Re: Latin digraph characters (was: Re: Klingon silliness)
Doug Ewell wrote: Aren't Serbian and Croatian the standard example of two "languages" that are really the same language but are treated separately This question about languages being "really" the same or no turns out to be a rather moot one from a linguist's viewpoint, even more so once the issue gets burdened with national feelings. I mean, are English and Scots the same? Are Bulgarian and Macedonian the same? Are Rumanian and Aromunian the same? Are Ancient Greek and Ancient Macedonian the same? Are Upper German and Lower German the same? Are German, Schwitzerdytsch and Letzeburgsch the same? Are Dutch and Flemish the same? Are British and American English the same (that was an issue at one time!) -- There are probably as many such issues as there are nations in the world, or more, and as a linguist you get weary of getting asked what the "real" answer is in each case. Are there any linguistic or vocabulary differences between them? Well, there are bound to be, at some level, and if not in the normative standards, then in the actual spoken varieties of relevant population centers. The question is just, how big are these, and--different and much more important question--how big do people *want* to *perceive* these differences to be? Lukas (P.S.: Sorry Doug, I meant to send this to the list in the first place.)
RE: What about musical notation?
About 25 years ago, I and several friends set some music in our church publications (newsletters and handouts for the congregation) using transfer symbols and photocopying. The process is definitely not suitable for serious publication. I saw a 19th century American music book set from movable type in the Newark, NJ public library about 35 years ago. Naturally I have no idea of the title or publisher now. :( In addition to its inflexibility, movable type is not well suited to music because it is difficult to get lines to join smoothly. At 8:26 AM -0800 2/22/01, Figge, Donald wrote: About thirty years ago, I was involved in the production of a song book. At that time, the notes were engraved directly onto copper plates by artisans who specialized in music engraving. Repro proofs were made from the plates, and then the words were pasted onto the proofs. Don // -Original Message- From: William Overington [mailto:[EMAIL PROTECTED]] Sent: Thursday, February 22, 2001 3:53 AM To: Unicode List Subject: Re: What about musical notation? - - Does anyone know of any details of metal music type please? William Overington 22 February 2001 -- Edward Cherlin Generalist "A knot!" exclaimed Alice. "Oh, do let me help to undo it." Alice in Wonderland
Re: Latin digraph characters (was: Re: Klingon silliness)
Doug Ewell frug: Aren't Serbian and Croatian the standard example of two "languages" that are really the same language but are treated separately (a) for political reasons and (b) because Cyrillic is used to write the former and Latin to write the latter? Are there any linguistic or vocabulary differences between them? The matter is much more complicated here. Linguistically speaking, there is a south slavonic dialect continuum from slovenian to bulgarian with no sharp language boundaries. There are, of course, many feature boundaries and isoglosses, as usual in dialect continua. Any national language is a contruction (where the degree of contructedness varies considerably). Serbocroatian (as a single language) is essentially a 19th century construction and became the national language of Yugoslavia after WW I. Serbian, Croatian, Bosnian (and maybe Montenegrin soon) are more recent constructions before and after the split of Yugoslavia into parts. There is lot of prescriptive language planning going on in order to make the three languages more different form each other. The national languages do not map the major dialect boundaries in the dialect continuum. If you can read german, I recommend to you the book of Detlev Blanke, Internationale Plansprachen, Akademie-Verlag Berlin whch contains lots of examples how national languages contained planned elements. I proceeds with a survey of planned languages and Esperanto. Did you know, the Slovak was reconstructed in the 19th century in order to make it more different from czech? --J"org Knappen
Re: What about musical notation?
At 3:52 AM -0800 2/22/01, William Overington wrote: Having been advised recently about accessing 21 bit unicode characters using an example from musical notation, following up on that advice I have found the document that details characters in the range U+1d100 to U+1d1ff, entitled Musical Symbols. [snip] I find myself interested in the possibility that unicode could be used to encode as a sequence of characters a representation of the contents of the composing stick of a hand set metal type printer, including the various items of spacing material of which the viewer of a finished print is not aware. One application at present would be so that fine quality type set illustrations of music and mathematics could be produced by placing that sequence of codes in the param statement of a java applet in a web page. Does anyone know of any details of metal music type please? William Overington 22 February 2001 The TeX DVI output file format does something close to what you describe by putting items and expressions composed from basic characters into boxes, and specifying the location of each box. Both horizontal and vertical spacing are expressed in integer multiples of the basic unit, 1/65536 of a true printer's point (1/72.27 in.). Font sizes are also specified in the same unit. The TeX source format includes codes for the common typographical spaces, several more specialized math spaces, and a general concept of "glue" spaces with numeric stretch and shrink parameters, including three orders of infinite stretchability. Further spacing control is provided in tables. Most software that handles mathematical expressions can translate them to TeX. This includes high-end math software such as Mathematica, technical publishing applications, notably FrameMaker, and ordinary word processors with built-in expression editors. In some cases, the translation from a word processing format requires an external utility. I suggest, therefore, that writing a downloadable TeX DVI renderer plug-in for a Web browser is a more general long-term solution for your application. Most of the code you would need is available as open source in C. It would not surprise me if a DVI renderer in Java had been done somewhere, although I have not heard of one. There is a Unicode TeX, called Omega TeX, capable of handling any writing system in principle, and supporting a fair number of writing systems in practice. -- Edward Cherlin Generalist "A knot!" exclaimed Alice. "Oh, do let me help to undo it." Alice in Wonderland
RE: Latin digraph characters (was: Re: Klingon silliness)
Douw Ewell wrote: Aren't Serbian and Croatian the standard example of two "languages" that are really the same language but are treated separately (a) for political reasons and (b) because Cyrillic is used to write the former and Latin to write the latter? Are there any linguistic or vocabulary differences between them? OT talking about languages, not scrips I think that the difference between the two is comparable to the difference between British and American English. (Oops! I am a Latino, not an Anglo, so change last 4 words to "Spanish and American Castillian" :-) I.e., relatively big differences in the colloquial languages, but just a handful of spelling and vucabulary variations in the unified literary language. I think, for instance, that "river" would be "rijeka" in Croatian and "reka" in Serbian ("ријека" ad "река", in Cyrillic). /OT _ Marco
Re: CJKV ideographic, was Re: Perception that Unicode is 16-bit
Tomasek idatzi zuen: More importantly, Han \u6f22 (Cant. Hon) really isn't an ethnonym used by the Cantonese and other southern Chinese; rather, Tang \u5510 (Cant. Tong) is used instead, e.g., tangcan \u5510\u9910 'Chinese cuisine' (Cant. tongchaan), tanghua \u5510\u8a71 'Chinese (spoken) language' (Cant. tongwa), tangren \u5510\u4eba 'Chinese person' (Cant. tongyan), tangrenjie \u5510\u4eba\u8857 'Chinatown' (Cant. tongyangaai), tangshan \u5510\u5c71 'China (lit. "Tang mountain")' (Cant. tongsaan), etc. Some of these terms are kind of old-fashioned or rustic, though. True, but it would be a bit unfair, since other groups use the same ethnonym. If we're looking for a high register term for Cantonese ideographs, how about 'YuhtJih' [7cb5\u5b57 ] (Mand. Yuzi)? I think I heard of a tangzi \u5510\u5b57 (Cant. tongji) term once; this would be most ideal to make use of, if one wanted to invent new English terminology. But that still leaves the problem of distinguishing the "dialect" characters of other southern Chinese languages from the mainstream characters, and the Cantonese "dialectal" characters. Basically just linguistic transcription, like the recently-created Hong Kong-indigenous Jyutping \u7cb5\u62fc (Mand. Yuepin) system. Unlike some other Chinese languages, romanization (usu. introduced by missionaries) didn't catch on, and the dominant (and conservative) trend is to write in Han characters, even if that means having to create new ones, hijacking existing ones, or resurrecting old ones. Just for the sake of our sanity ; ) with the number of homophones we have, writing entirely in romanization is ... an interesting pasttime. Believe me, I've tried ... and the Yale English Cantonese dictionary still drives me nuts for not having characters ... So apart from dictionaries you find romanization (a myriad of varieties) for the transcription of names and shops, place names ... and that's about it I would think. I believe 'status' comes into this a lot - educated people "know how to write" ... Michael
Re: Klingon silliness
At 15:03 -0800 2001-02-27, David Starner wrote: Then why do you berate people for working on Klingon? They're probably no more an expert at the minority scripts than you are. One begs to differ. ;-) They're just working on what they find interesting and are knowledgable about. I guess in a bit I'll have to say something about all this Klingon silliness. -- Michael Everson ** Everson Gunn Teoranta ** http://www.egt.ie 15 Port Chaeimhghein ochtarach; Baile tha Cliath 2; ire/Ireland Mob +353 86 807 9169 ** Fax +353 1 478 2597 ** Vox +353 1 478 2597 27 Pirc an Fhithlinn; Baile an Bhthair; Co. tha Cliath; ire
Re: What about musical notation?
Edward Cherlin [EMAIL PROTECTED] writes: I suggest, therefore, that writing a downloadable TeX DVI renderer plug-in for a Web browser is a more general long-term solution for your application. Most of the code you would need is available as open source in C. It would not surprise me if a DVI renderer in Java had been done somewhere, although I have not heard of one. There's a Java DVI viewer available at: http://www.geom.umn.edu/java/idvi/ However, it's not free software (it's only for non-commercial use).
Re: Latin digraph characters (was: Re: Klingon silliness)
On Tue, Feb 27, 2001 at 08:38:04PM -0800, [EMAIL PROTECTED] wrote: Aren't Serbian and Croatian the standard example of two "languages" that are really the same language but are treated separately (a) for political reasons and (b) because Cyrillic is used to write the former and Latin to write the latter? Are there any linguistic or vocabulary differences between them? They are very similar, but there are subtle differences. The reasons are not just political but cultural and religious. The Serbians are mostly Orthodox, which is why they use Cyrillic. The Croatians are mostly Catholic, hence the use of the Roman alphabet. (It is a fairly precise rule that you can tell whether a Slavic nation has been historically Eastern Orthodox or Roman Catholic by which alphabet they use. Naturally, there have been other developments, e.g., there is a strong Lutheran minority in Slovakia, a strong Hussite tradition in Bohemia, etc. And, of course, you cannot assume that any individual is of a specific religion based on his nationality.) Adam -- Cogitans me cogito esse
Re: Latin digraph characters (was: Re: Klingon silliness)
On Wed, Feb 28, 2001 at 12:39:09AM -0800, J%ORG KNAPPEN wrote: Did you know, the Slovak was reconstructed in the 19th century in order to make it more different from czech? Not true. Written documents dating back to the Middle Ages clearly show that Slovak has been virtually unchanged since then. What you are talking about is the difference between Bernolk and tr, both of whom tried to codify Slovak in the 19th Century, that is to say, to create a set of formal rules (just as Webster did in America). Bernolk did it first. He used a Western Slovak dialect. His attempt failed, mostly because that dialect was not representative of the way most Slovaks speak. tr then attempted the same but using the Central Slovak dialect. He was very successful because that dialect was something all Slovaks could easily accept as the "official" language of Slovakia. Since Western Slovakia is closer to Bohemia (where the Czechs live), naturally, its dialect is closer to Czech (still quite distinct but closer) than the Central Slovak dialect. Because Bernolk's attempt predated tr, I can see how it could give out the impression of "reconstruction" but that is not what happened. Rather, Bernolk's attempt failed because it may have worked for the Slovak intelligentsia living in Bratislava (which is in Western Slovakia), it made no sense to the rest of the nation. tr was successful (even Bernolk accepted him) because he did it right. By the way, there was another attempt which preceded both. I cannot remember the author's name off the top of my head. This gentleman wrote what is now considered the first Slovak novel. He pretty much created his own Slovak based on his own theories of what Slovak should be. In the 20th Century someone translated that novel into real Slovak, so people can actually understand it. All of this happened during the 19th Century period of nationalism (which arose not just in Slovakia) as a reaction to the attempt of having Slovaks assimilated into Hungary after the Austrian Empire turned into Austro-Hungarian Empire and Hungarian became the official legal language of that part of the Empire (before that, Latin was the official language of the entire Empire and all nationalities used their own languages in day-to-day communications in their respective territories -- incidentally, Slovak has two very different words for Hungarian, one describing the entire territory of the Kingdom of Hungary, one describing the specific language of modern Hungary -- that means that a 19th Century Slovak could say, yes I am a Hungarian, as in someone from the Kingdom of Hungary, and at the same time, no, I am certainly not a Hungarian, as in a member of the nation living in modern Hungary -- the first word was Uhor, the second Maar, quite different). The Czechs lived in the Austrian section of the Empire, the Slovaks in the Hungarian section. Back then no one would have suggested that Czech and Slovak were the same language (they have been distinct linguistically, culturally, and politically, ever since the Slavs moved to Europe). There was no need to "reconstruct" Slovak to make it more distinct from Czech. Cheers, Adam -- When two do the same, it's not the same -- Slovak proverb
Re: CJKV ideographic, was Re: Perception that Unicode is 16-bit
On Wednesday, February 28, 2001, at 01:54 AM, akerbeltz.alba wrote: So apart from dictionaries you find romanization (a myriad of varieties) for the transcription of names and shops, place names ... and that's about it I would think. I believe 'status' comes into this a lot - educated people "know how to write" ... Most place names in Hong Kong seem to be modified versions of Wade-Giles, which I (having learned Yale first) tend to find almost incomprehensible. Still, it was always fun to hear British news readers talk about what had just happened in, say, Sham Shui Po. = John H. Jenkins [EMAIL PROTECTED] [EMAIL PROTECTED] http://homepage.mac.com/jenkins/
utf-1.3 and utf-1.4
On http://www.atm.ch.cam.ac.uk/acmsu/utf/ I found the acronym utf used in a very different way than UNicoders/ISO10646ers use it. Fortunately, there never was a utf-1.3 or utf-1.4 in our context. --J"org Knappen
Re: [OT] What is DEL for?
On Thu, 22 Feb 2001, [EMAIL PROTECTED] wrote: Otto Stolz [mailto:[EMAIL PROTECTED]] wrote: Dear Unicoders, again, I have inadvertently sent a contribution to a member rather than to the whole list, because the Unicode list sets the Reply-to header in an utmost inconvenient and unexpected manner. Here is a copy for the list. I hope I will not mistype the address. I really wish that I simply could use the reply-to-sender function of my MUA to answer to the Unicode list. ... Or maybe you need a mail client that allows you to apply a special rule to messages that come from this list such that any reply you send to a list message defaults to the address in the To: line rather than that in the From: or Reply To: line of the original message. I agree with you about the specific case of the Reply-To: header, but I think that it might be a good idea to change the list setup in other ways. For instance, the current setup seems to remove all In-Reply-To: and References: headers. This is a problem since it breaks the ability of my email program (Gnus) to do threading, for no particularly good reason. In fact, I believe that all headers except for a those in a particular list are removed. Another annoyance is that no special headers are used to indicate that the message is in fact from this mailing list, so that you have to use the Sender: header etc. to do mail splitting, which is also annoying. -- Gaute Strokkeneshttp://www.srcf.ucam.org/~gs234/ I'm using my X-RAY VISION to obtain a rare glimpse of the INNER WORKINGS of this POTATO!!
Re: Klingon silliness
Arsa Michael Everson: At 13:05 -0800 2001-02-27, Timothy Partridge wrote: How come the Klingons only have one language and script? :-) The victors successfully assimilated the conquered. -- Michael Everson Sure they can assimilate? I'm reliably informed that they only cling on. mg -- Marion Gunn Everson Gunn Teoranta http://www.egt.ie
Re: Latin digraph characters
Germans transliterate a single cyrillic letter with TSCH, shouldn't Unicode have also this tetragraph encoded? (ducking...) Is this the Cyrillic letter that is transliterated into English as CH pronounced as CH in church? There are in mathematics some polynomials called Chebyshev polynomials after a mathematician whose name was written in Cyrillic characters. I think that he was Russian, but I am not congruently certain of that. I remember seeing once that his name is sometimes expressed in roman characters as Chebyshev and sometimes in another way that I do not precisely remember and will not guess at but it began with the letter T. It is an interesting circumstance that Chebyshev polynomials are represented as y = T0(x) y = T1(x) y = T2(x) y = T3(x) and so on for all non-negative integers, where the number following the letter T should be written as a subscript and are not done so here because of the limitations of this email format. Perhaps there is some interesting footnote to this circumstance and that maybe the T refers to the mathematician's surname and that for some peculiarity of history of different routes being used at different times his surname was transliterated by one method for defining the functions and by another method for stating his name. Does anyone know whether there is any evidence of that being the case? William Overington 28 February 2001
RE: Latin digraph characters
William, I can only second your assumption for naming the Chebyshev polynomials Tx(), since the German transliteration is indeed Tschebyscheff (as the mathematician in me remembers...). FYI, there is one cyrillic character (U+0429: ?) that is transliterated as SCHTSCH (in German), a few years ago there was one Nathan Tcharansky (Schtscharanski in German). getting a little [OT] now... Reinhard G. Handwerker, Sr. i18n Engineer Internet Security Systems -Original Message- From: William Overington [mailto:[EMAIL PROTECTED]] Sent: Wednesday, February 28, 2001 13:14 To: Unicode List Cc: [EMAIL PROTECTED] Subject: Re: Latin digraph characters Germans transliterate a single cyrillic letter with TSCH, shouldn't Unicode have also this tetragraph encoded? (ducking...) Is this the Cyrillic letter that is transliterated into English as CH pronounced as CH in church? There are in mathematics some polynomials called Chebyshev polynomials after a mathematician whose name was written in Cyrillic characters. I think that he was Russian, but I am not congruently certain of that. I remember seeing once that his name is sometimes expressed in roman characters as Chebyshev and sometimes in another way that I do not precisely remember and will not guess at but it began with the letter T. It is an interesting circumstance that Chebyshev polynomials are represented as y = T0(x) y = T1(x) y = T2(x) y = T3(x) and so on for all non-negative integers, where the number following the letter T should be written as a subscript and are not done so here because of the limitations of this email format. Perhaps there is some interesting footnote to this circumstance and that maybe the T refers to the mathematician's surname and that for some peculiarity of history of different routes being used at different times his surname was transliterated by one method for defining the functions and by another method for stating his name. Does anyone know whether there is any evidence of that being the case? William Overington 28 February 2001
Re: Latin digraph characters
[utf-8] William Overington wrote: Germans transliterate a single cyrillic letter with TSCH, shouldn't Unicode have also this tetragraph encoded? (ducking...) Is this the Cyrillic letter that is transliterated into English as CH pronounced as CH in church? Yes. There are in mathematics some polynomials called Chebyshev polynomials after a mathematician whose name was written in Cyrillic characters. I think that he was Russian, but I am not congruently certain of that. Neither do I. But see below. I remember seeing once that his name is sometimes expressed in roman characters as Chebyshev and sometimes in another way that I do not precisely remember and will not guess at but it began with the letter T. Perhaps "Tchébicheff", i.e. the French way. It happens that the international way to spell Russian names is to use the French way of translating. This is important for passports, for example. The German way would be "Tschebyscheff", or something like that. Incidentally, I gave a look at Don Knuth's site (he has a long list of non-Latin names of famous mathematicians), at URL:http://Sunburn.Stanford.EDU/~knuth/help.html#exotic. And Don Knuth gave Chebyshev and Tschebyscheff, and not the French Tchébicheff. Also it appears that "Tchébicheff" is not very common on the web, much less than is "Tschebyscheff", which is itself much more uncommon than Chebyshev is (resp. 65, 1100 and 31000 in Google). Also, Don Knuth gives Пафнутий for his first name, which does not sounds very Russian to me. So it might happen that Chebyshev came from a region later dominated by Germany, hence had his name changed toward the German orthographic rules. Just a thought. It is an interesting circumstance that Chebyshev polynomials are represented as y = T0(x) Curiously, I do not remember using T, but rather the more usual P, to represent the Chebyshev polynomials while at school (but it is quite a long time ago, so I may easily record incorrectly). Antoine
Re: Latin digraph characters
On Wed, Feb 28, 2001 at 11:19:37 -0800, Antoine Leca wrote: [utf-8] [koi8-r] ;-( I know I should upgrade my mailer. Also, Don Knuth gives ðÁÆÎÕÔÉÊ for his first name, which does not sounds very Russian to me. It's Russian. Though, surely, not of Russian/Slavic origin. He was born on May 4 (Julian) in a village just few miles avay of a monastery famous for and named after Russian saint of the same name. Incidentally Russian Orthodox Church celebrate the memory of this saint on May 1 (Julian). Hence the name choice is not that surprising given the time and the place of birth. http://users.kaluga.ru/school6/chebishev/Family.htm http://www.days.ru/Life/life964.htm SY, Uwe -- [EMAIL PROTECTED] | Zu Grunde kommen http://www.ptc.spbu.ru/~uwe/| Ist zu Grunde gehen
UTF-8, C1 controls, and UNIX
The idea behind UTF-8 is to be able to use it in non-Unicode-aware UNIX versions: It lets you have Unicode filenames, Unicode directory names, Unicode file contents, Unicode email, etc. But what it does not do is let you *type* Unicode into regular UNIX applications or shells, if the UTF-8 happens to contain C1 control characters as do, for example, many of the Cyrillic letters (e.g. capital A through PE). Most UNIX terminal drivers treat incoming C1 controls like their C0 counterparts, so 0x83 == 0x03 == Ctrl-C, which interrupts whatever process you are talking to. Similarly 0x84 == Ctrl-D, which is EOF; 0x88 is backspace, and so on. I suppose this is a statement of the obvious, but now that I'm using a Unicode based terminal emulator with UTF-8 character set and trying to compose e-mail and netnews containing Russian words in a Telnet session to UNIX, the problem is suddenly concrete. We have said that UTF-8 is a kind of "transport form" that must be decoded prior to (e.g.) terminal escape sequences in the host-to-terminal direction. That's fine, the terminal emulator can (and does) do that. But in the other direction there is no such decoder on the UNIX end. The bare C1 octets are read by the UNIX terminal driver, which treats them as interrupt, suspend, xoff, tab, carriage return, linefeed, and all the rest. Here the model breaks down -- it is not symmetric. The nice thing about ISO 8859-1 was that it could be freely used in UNIX, in both directions, without UNIX knowing a thing about it. The same is not true for UTF-8. - Frank
Re: Latin digraph characters
On Wed, 28 Feb 2001, Antoine Leca wrote: William Overington wrote: Germans transliterate a single cyrillic letter with TSCH, shouldn't Unicode have also this tetragraph encoded? (ducking...) Is this the Cyrillic letter that is transliterated into English as CH pronounced as CH in church? Yes. ... I remember seeing once that his name is sometimes expressed in roman characters as Chebyshev and sometimes in another way that I do not precisely remember and will not guess at but it began with the letter T. The initial character of the name is transliterated as CH in English, TCH in French, TSCH in German, C or CI in Italian, C WITH CARON in the official Russian transliteration. It's the same character as the first one in Chajkovskij, Chekhov... Perhaps "Tchbicheff", i.e. the French way. It happens that the international way to spell Russian names is to use the French way of translating. This is important for passports, for example. The Russian passports that I have seen (ok, only two) used the official transliteration, which is different from the French one. The French transliteration was popular once, as French was widely known between high-class Russians. In material printed nowadays in Russia I have always seen used the official transliteration, when needed. __ ||oka, Pierpaolo
Re: What about musical notation?
One application at present would be so that fine quality type set illustrations of music and mathematics could be produced by placing that sequence of codes in the param statement of a java applet in a web page. You may have a look at Lilypond, which is a free musical typesetting engine producing output for TeX (direct PS output is still experimental). http://www.cs.uu.nl/~hanwen/lilypond/index.html Werner
Re: UTF-8, C1 controls, and UNIX
Maybe one should make a transmission safe UTF that left C1 alone? Remember this? -- From: Markus Scherer [EMAIL PROTECTED] To: "Unicode List" [EMAIL PROTECTED] Date: Mon, 10 Apr 2000 15:23:53 -0800 (GMT-0800) Subject: What if UTF-8 had been defined after UTF-16? What if UTF-8 had been defined just for the code point range 0..0x10? What if UTF-8 had been designed to be not just "File-System-Safe" but also "Terminal-Safe"? UTF-8 could have had all the nice features that it has now, plus: - C1 control codes (0x80..0x9f) passed through as single bytes - no sequences longer than 4 bytes, BMP still covered with 3 bytes - no checking for code points 0x10 because it could have been designed just for that range - no minimum-length problem - no security concerns - all byte values used for some encoding It would have been possible. Interested? See http://www.mindspring.com/~markus.scherer/utf-8c1.html . Note: This is _not_ an approved UTF. I am _not_ proposing this as a new UTF. This is _not_ compatible with any existing UTF or other Unicode implementation. It is just a play with bits and bytes, a "what if", a "Gedankenexperiment". Just to share a thought - markus
Re: Question on Unicode data files
Mr Zhang is CEO of that company. Regards, Jianping. John Jenkins wrote: On Monday, February 26, 2001, at 09:12 PM, Richard Cook wrote: Is there any connection between this http://www.unihan.com.cn/ site and IRG? What is UniHan Digital Tech Co.? Their website has some rather annoying graphics and windows, but no basic info that i can see ... the bottom buttons don't work at all, no? I don't know who they are. They're not associated with the IRG that I'm aware. I'm checking with Mr. Zhang to see if he's heard of them. = John H. Jenkins [EMAIL PROTECTED] [EMAIL PROTECTED] http://homepage.mac.com/jenkins/ begin:vcard n:Yang;Jianping tel;fax:650-506-7225 tel;work:650-506-4865 x-mozilla-html:FALSE org:Server Gobalization Technology;Server Technology version:2.1 email;internet:[EMAIL PROTECTED] title:Senior Development Manager adr;quoted-printable:;;500 Oracle Packway=0D=0AM/S 659407;Redwood Shores;CA;94065; fn:Jianping Yang end:vcard
Re: UTF-8, C1 controls, and UNIX
On Wed, Feb 28, 2001 at 01:11:20PM -0800, Frank da Cruz wrote: The idea behind UTF-8 is to be able to use it in non-Unicode-aware UNIX versions: It lets you have Unicode filenames, Unicode directory names, Unicode file contents, Unicode email, etc. But what it does not do is let you *type* Unicode into regular UNIX applications or shells, if the UTF-8 happens to contain C1 control characters as do, for example, many of the Cyrillic letters (e.g. capital A through PE). Most UNIX terminal drivers treat incoming C1 controls like their C0 counterparts, so 0x83 == 0x03 == Ctrl-C, which interrupts whatever process you are talking to. Similarly 0x84 == Ctrl-D, which is EOF; 0x88 is backspace, and so on. Maybe one should make a transmission safe UTF that left C1 alone? keld
RE: Latin digraph characters
The initial character of the name is transliterated as CH in English, TCH in French, TSCH in German, C or CI in Italian, C WITH CARON in the official Russian transliteration. It's the same character as the first one in Chajkovskij, Chekhov... ...and as TS (the S with caron) in the Finnish transliteration. (Finnish uses S-caron and Z-caron but only to transliterate Cyrillic, and also in certain loanwords that contain the same "shound", such as "sakki" (s-caron tehere) for "chess") When caron is not available, I think the allowed fallback is TSH.
Spacing diacritics in Greek Extended
As you know, in the short term any texts out there in Unicode polytonic Greek use precomposed characters, as people are not waiting for the intelligent font engines of the future. To put texts in Unicode, they convert them from existing codings. In all of these existing codings, be they 8-bit or ASCII-based (Beta Code), a capital letter with diacritics (titlecase) is rendered as two glyphs: the diacritics, as a spacing glyph, and then the capital. Since people have no familiarity with single-glyph capitals-with-diacritics, they are doing the same with their precomposed Unicode glyphs, using the spacing diacritics at the bottom of Greek Extended. See for example http://www.fordham.edu/halsall/basis/thomais-uni.html : the diacritics in section 5, at least, are separate glyphs. Unicode allows these spacing diacritic glyphs, but the Standard says that "unless information is present to the contrary", they should be interpreted as SPACE + non-spacing equivalent diacritic (Unicode 3.0, p.169-170). Would it be expedient to change this to having it postmodify the next character, as a legitimate legacy concern (which is why the precomposeds are there in the first place?) Fortunately the main online resource for converting into Unicode polytonic Greek (Sean Redmond's, http://www.jiffycomp.com/smr/unicode/convert.php3) is well-behaved in this regard. -- Nick Nicholas. TLG, UCI, USA. [EMAIL PROTECTED]; www.tlg.uci.edu/~opoudjis Many among their proselytes had sold their lands and houses to increase the public riches of the sect --- at the expense, indeed, of their unfortunate children, who found themselves beggars because their parents had been saints. (Edward Gibbon, _Decline and Fall_.)
ELL: Aboriginal Language TV Series
FYI - Forwarded by Peter Constable/IntlAdmin/WCT on 02/28/2001 07:23 PM - |+-- || "Paul M. Rickard" | || [EMAIL PROTECTED] | || Sent by:| || [EMAIL PROTECTED]| || och.edu.au | || | || | || 02/28/2001 04:58 PM | || Please respond to endangered-languages-l| || | |+-- -| | | | To: "[EMAIL PROTECTED]" [EMAIL PROTECTED] | | cc: | | Subject: ELL: Aboriginal Language TV Series | -| For those interested: Mushkeg Media Inc., a native-owned production company is currently broadcasting a 13 part series on APTN (Aboriginal People's Television Network) on the state of Aboriginal languages in Canada. We are also planning season two of the series and are looking for any interesting and unique language revitilization programs or iniatitives by individuals, communities or organizations across the land. The series can be seen on APTN every Thursday at 2:30 pm and 11:30 pm EASTERN time. Listed below is the current schedule as seen on APTN. For more information about the series or any story ideas you can email: [EMAIL PROTECTED]or [EMAIL PROTECTED] Also check out: www.aptn.ca for schedule and tv channel on cable in different parts of the Canada. Can also be picked up on Bell-Express Vu and StarChoice satellite dish for those in remote areas. Meegwetch. Paul M. Rickard Mushkeg Media Inc. 103 Villeneauve West Montreal, Quebec H2T 1 R6 [EMAIL PROTECTED] APTN: FINDING OUR TALK: EPISODES STARTING THURSDAY FEB 1 AT 2:30pm 11:30pm EASTERN TIME Episode 1 - Feb.1: Language Among the Skywalkers: Mohawk: This is the story of the legendary Mohawk ironworkers, and of new approaches to language instruction for both adults and children within the contemporary community of Kahnawake. Episode 2 - Feb. 8: Language Immersion: Cree: This episode will trace the history of the very successful Cree Language Immersion Program, developed and implemented in schools in the Cree communities of Northern Qubec. Episode 3 - Feb. 15: The Trees are Talking: Algonquin: George and Maggie Wabanonick take a group of teens to the woods to initiate them in their traditional culture and language. In the classroom, the kids and teachers struggle with their Algonquin lessons, while the pop group Anishnabe give the language new life. Episode 4 - Feb 22: The Power of Words: Inuktitut: At a language conference in Puvirnituq, we witness efforts to keep Inuktitut alive and up-to-date, largely through the knowledge and commitment of elders. Episode 5 - March 1: Words Travel On Air: Attikamekw, Innu: Karin Awashish, a young radio journalist working at SOCAM, makes a trip to her home community to tape interviews and legends told by elders in Attikamekw, as part of the network's language initiative. Episode 6 - March 8: Language in the City: Ojibwe/Anishinabe: This episode will focus on Isadore Toulouse's weekly trajectory to four different urban-based schools, where we witness first-hand, and with raw immediacy, his efforts to pass on his own enthusiasm and passion for the Ojibwe language. Episode 7 - March 15: Getting Into Michif: Michif: We meet some of the movers and shakers working politically and through the education system, to have Michif recognized as the official language of the Mtis, as well as those whose passion and dedication are evidenced at the grass-roots level. Episode 8 - March 22: Plains Talk: Saulteaux: This episode follows the work of a virtually self-taught, highly motivated language teacher. Stella Ketchemonia has devoted her life to teaching the Saulteaux language . She is now a member of the dynamic staff of the Saskatchewan Indian Federated College. Episode 9 - March 29: Breaking New Ground: Mi'kmaw: This episode looks at two projects; a pilot to have Mi'kmaw adopted as an
Re: Spacing diacritics in Greek Extended
Nick Nicholas said: As you know, in the short term any texts out there in Unicode polytonic Greek use precomposed characters, as people are not waiting for the intelligent font engines of the future. To put texts in Unicode, they convert them from existing codings. In all of these existing codings, be they 8-bit or ASCII-based (Beta Code), a capital letter with diacritics (titlecase) is rendered as two glyphs: the diacritics, as a spacing glyph, and then the capital. Since people have no familiarity with single-glyph capitals-with-diacritics, they are doing the same with their precomposed Unicode glyphs, using the spacing diacritics at the bottom of Greek Extended. See for example http://www.fordham.edu/halsall/basis/thomais-uni.html : the diacritics in section 5, at least, are separate glyphs. Unicode allows these spacing diacritic glyphs, but the Standard says that "unless information is present to the contrary", they should be interpreted as SPACE + non-spacing equivalent diacritic (Unicode 3.0, p.169-170). Would it be expedient to change this to having it postmodify the next character, as a legitimate legacy concern (which is why the precomposeds are there in the first place?) No, if what you mean is a mechanical change of interpretation of such a sequence, so that the Unicode Standard would specify that: 1F0A (for example) = 1FCD, 0391 = 0391, 0313, 0300 The intermediate node of that equivalence would be totally out of whack for Unicode, formally, since it decomposes instead to: 0020, 0313, 0300, 0391 i.e., not the same as the recursive decomposition of 1F0A. What the text on pp. 169-170 says, in full is: "Decomposition of [Greek Diacritic] Spacing Forms. When decomposing the spacing forms, the spacing status of the implied usage must be taken into account. Unless information is present to the contrary, these spacing forms would be decomposed to U+0020 SPACE followed by the nonspacing form equivalents shown in Table 7-2." The exegesis of that passage is as follows. If you are simply decomposing text by a general algorithm, as for a Unicode Normalization Form (UAX #15), then you *must* use the normative decomposition mappings, as specified by that algorithm. I.e., 1FCD, 0391 normalized to NFKD is 0020, 0313, 0300, 0391 and nothing else. However, if you have "information present to the contrary", as would be the case if you were doing intelligent conversion of polytonic Greek, then it is perfectly o.k. to take a Unicode representation of a compatibility sequence, i.e. 1FCD, 0391, perhaps obtained by a one-to-one mapping against an 8-bit implementation, and turn that into the preferred Unicode representation of polytonic Greek, i.e., 0391, 0313, 0300. This is a knowing transformation of the data from one form to another form by a process aware of these equivalences. But that is comparable, for example, to doing a transliteration from one form to another form, rather than being a built-in normative equivalence defined by the Unicode Standard itself. Fortunately the main online resource for converting into Unicode polytonic Greek (Sean Redmond's, http://www.jiffycomp.com/smr/unicode/convert.php3) is well-behaved in this regard. Good. I expect for these kinds of issues smart implementers ought to be able to "do the right thing". ;-) --Ken
RE: Handling UTF-8
Hi What is your system architecture? A lot depends on it. btw, you could check up www.unicode.org, goto Useful resources. I think it might have a few tools and libraries that could be helpful. Hope this helps Thejo -- From: Polykarpos Karamaoynas Sent: Tuesday, February 27, 2001 7:02 PM To: Unicode List Subject: Handling UTF-8 Hello to everybody, i'd like to know if there is any available library which handles UTF-8 strings. I intend to make a non graphic application interface which takes as input unicode strings, handles them and communicates with data base by quering it or informing it with UTF-8 format. Is there anything available that could help me for this? Thank you in advance. Polykarpos
Re: UTF-8, C1 controls, and UNIX
In a message dated 2001-02-28 15:13:02 Pacific Standard Time, [EMAIL PROTECTED] writes: Maybe one should make a transmission safe UTF that left C1 alone? Remember this? -- From: Markus Scherer [EMAIL PROTECTED] To: "Unicode List" [EMAIL PROTECTED] Date: Mon, 10 Apr 2000 15:23:53 -0800 (GMT-0800) Subject: What if UTF-8 had been defined after UTF-16? What if UTF-8 had been defined just for the code point range 0..0x10? What if UTF-8 had been designed to be not just "File-System-Safe" but also "Terminal-Safe"? Keld may have been referring to his own "UTF-7d5", described at http://www.uni-mainz.de/~knappen/jk009.html. Like UTF-8, it can express basic characters in no more than 3 code units, but unlike UTF-8 it requires the additional layer of UTF-16 to express supplementary characters (so they take 6 code units). UTF-1, the original UTF, was also designed not to use C0 or C1 bytes, or space or DEL, except to represent themselves. However, apparently the "slash" issue was deemed more critical than avoiding C1 bytes. -Doug Ewell Fullerton, California
Re: Latin digraph characters
In a message dated 2001-02-28 14:13:23 Pacific Standard Time, [EMAIL PROTECTED] writes: The initial character of the name is transliterated as CH in English, TCH in French, TSCH in German, C or CI in Italian, C WITH CARON in the official Russian transliteration. It's the same character as the first one in Chajkovskij, Chekhov... Oddly enough, however, the great romantic composer's name is generally rendered "Tchaikovsky" in English, with the "tch" combination that is not used in any other English transliterations of Russian names. In olden days the German or German-like spelling "Tschaikowsky" was more common. -Doug Ewell Fullerton, California