Re: [unicode] Re: conformance for unicode 2.x?
At 11:54 AM 6/6/2003 -0700, Mark Davis wrote: We never put the 2.0 Standard online. Any particular reason why someone (I volunteer if no one else does) can't type int the conformance section (it is plain English text if it is like the 3.0 chapter 3)? Also I would be curious how many copies of this extant and extinct book were actually sold so I know how hard it will be to find a copy. OK, I know *I* can find a copy at a colleague and have them photocopy the relelevant section for me, but what about the rank and file developer? Is this a good thing for this standard to exist only in a long out of print book? RFCs, even the obsoleted ones, live forever online in numerous archives...shouldn't Unicode strive for the same immortality? I am asking about v2 for a selfish reason, but everything above might as well be about v1 also . Barry We do, of course, keep copies in our office, but your best bet is to borrow one from a collegue. Mark __ http://www.macchiato.com ► “Eppur si muove” ◄ - Original Message - From: Barry Caplan [EMAIL PROTECTED] To: Mark Davis [EMAIL PROTECTED]; [EMAIL PROTECTED] Sent: Friday, June 06, 2003 10:14 Subject: Re: conformance for unicode 2.x? Thanks Mark, but I had done all that online searching before I posted to the list. Is the book (which I no longer have a copy of) the only place where the details for conformance for 2.x are archived? If so, is that a good idea? Barry At 11:09 AM 6/5/2003 -0700, Mark Davis wrote: If you start on http://www.unicode.org/ and click on Start Here, you'll get to a page about the Unicode Standard. In the left-hand column, clicking on Versions of the Unicode Standard will get you to http://www.unicode.org/standard/versions/. In the left-hand column you will see the different versions of the standards. Unicode 2.1.9 takes you to http://www.unicode.org/standard/versions/enumeratedversions.html#Unic ode_2_1_9, where you will find the major and minor references. If you look in the book, you'll find conformance is chapter 3. The Unicode Consortium. The Unicode Standard, Version 2.0 Reading, MA, Addison-Wesley Developers Press, 1996. ISBN 0-201-48345-9. Clauses may be amended by: Moore, Lisa. Unicode Technical Report #8, The Unicode Standard, Version 2.1, Revision 2. Cupertino, CA, The Unicode Consortium, 1998. Mark __ http://www.macchiato.com ⺠â?oEppur si muoveâ? â- - Original Message - From: Barry Caplan [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Thursday, June 05, 2003 10:34 Subject: conformance for unicode 2.x? I was trying to find the place on unicode.org where conformance for 2.x is defined. I think one of the 2.1.x updates referred back to earlier conformance specs, but I couldn't find them. Any pointers? Thanks! Barry Yahoo! Groups Sponsor To Unsubscribe, send a blank message to: [EMAIL PROTECTED] This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html Your use of Yahoo! Groups is subject to the Yahoo! Terms of Service. inline: 18ae4c9.jpginline: 18ae4f1.jpg
conformance for unicode 2.x?
I was trying to find the place on unicode.org where conformance for 2.x is defined. I think one of the 2.1.x updates referred back to earlier conformance specs, but I couldn't find them. Any pointers? Thanks! Barry
Re: conformance for unicode 2.x?
Thanks Mark, but I had done all that online searching before I posted to the list. Is the book (which I no longer have a copy of) the only place where the details for conformance for 2.x are archived? If so, is that a good idea? Barry At 11:09 AM 6/5/2003 -0700, Mark Davis wrote: If you start on http://www.unicode.org/ and click on Start Here, you'll get to a page about the Unicode Standard. In the left-hand column, clicking on Versions of the Unicode Standard will get you to http://www.unicode.org/standard/versions/. In the left-hand column you will see the different versions of the standards. Unicode 2.1.9 takes you to http://www.unicode.org/standard/versions/enumeratedversions.html#Unicode_2_1_9, where you will find the major and minor references. If you look in the book, you'll find conformance is chapter 3. The Unicode Consortium. The Unicode Standard, Version 2.0 Reading, MA, Addison-Wesley Developers Press, 1996. ISBN 0-201-48345-9. Clauses may be amended by: Moore, Lisa. Unicode Technical Report #8, The Unicode Standard, Version 2.1, Revision 2. Cupertino, CA, The Unicode Consortium, 1998. Mark __ http://www.macchiato.com ⭺ “Eppur si muove” ◄ - Original Message - From: Barry Caplan [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Thursday, June 05, 2003 10:34 Subject: conformance for unicode 2.x? I was trying to find the place on unicode.org where conformance for 2.x is defined. I think one of the 2.1.x updates referred back to earlier conformance specs, but I couldn't find them. Any pointers? Thanks! Barry
urban legends just won't go away!
http://archive.devx.com/free/tips/tipview.asp?content_id=4151 Who knew in this day and age flipping bits to change case is still publishable (this is from today!) Barry Caplan www.i18n.com Vendor Showcase: http://Showcase.i18n.com -- Use Logical Bit Operations to Changing Character Case This is a simple example demonstrating my own personal method. // to lower case public char lower(int c) { return (char)((c = 65 c = 90) ? c |= 0x20 : c); } //to upper case public char upper(int c) { return (char)((c = 97 c =122) ? c ^= 0x20 : c); } /* If I would I could create a method for converting an entire string to lower, like this: */ public String getLowerString(String s) { char[] c = s.toCharArray(); char[] cres = new char[s.length()]; for(int i=0;ic.length;++i) cres[i] = lower(c[i]); return String.valueOf(cres); } /* even converting in capital: */ public String capital(String s) { return String.valueOf(upper(s.toCharArray()[0])).concat(s.substring(1)); } /* using it*/ public static void main(String args[]) { x xx = new x(); System.out.println(xx.getLowerString(LOWER: + FRAME)); System.out.println(xx.upper('f')); System.out.println(xx.capital(randomaccessfile)); }
Re: Documenting in Tamil Computing
At 10:34 AM 12/17/2002 +0100, Stephane Bortzmeyer wrote: There are various extensions and kluges described in various RFCs (ESMTP, MIME, etc. ) All these extensions are referenced in the same RFC, 2821, which is the authoritative one about SMTP. I do not know any mainstream SMTP server which does not implement them. The most important for us is 8BITMIME: Eight-bit message content transmission MAY be requested of the server by a client using extended SMTP facilities, notably the 8BITMIME extension [20]. 8BITMIME SHOULD be supported by SMTP servers. There is another RFC, whose number I forget, that defines should. Essentially it says you must not rely on anyone else actually implementing this feature. but they are not universally implemented at the server transport layer, This is absolutely wrong. sendmail, Postfix and qmail allow 8-bits transport for a *very* long time. Well, aside from the fact that those are not the only 2 pieces of mail transport sw by a long shot, this feature e is a configurable option, and may not always be turned on. But for arbitrary email from one address to another, you can't rely on it. I send Latin-1 (ISO 8859-1) emails for more than ten years (and without using quoted-printable or other similar hacks) to French-speaking people in various parts of the world and I'm still waiting for an actual problem. You're playing with words. Not really - this is very clearly dealt with in an RFC that defines SHOULD and MUST. In real life, all SMTP servers support 8-bits mail because all SMTP servers authors are aware of the issue (true, it was long and difficult to convince them all but it worked). Any counter-example? Jungshik Shin wrote: Besides, some email servers still don't abide by ESMTP standard and don't include '8BITMIME' in their response when queried with 'EHLO' although they support 8bit clean transport (as you wrote). I did a quick survey of mail servers in the .com top level domain about 18 months ago to see which servers implemented 8bitmime and which didn't. IIRC, about 20% or more did not. As I said earlier, that does not mean 8 nits wouldn't go through anyway if they are modern servers, but you can't rely on that. I would like to do a wider survey if someone could donate some bandwidth or maybe someone at W3 who was going to look into this at the time can bring this back to top of the things to do list (no names, but I am pretty sure he is on this list...:) Barry Caplan www.i18n.com
Re: Documenting in Tamil Computing
At 08:32 PM 12/15/2002 -0500, Jungshik Shin wrote: because Unicode is not mature enough to be used in multilingual email yet. You just have to make do with the 8bit TSCII encoding for Tamil eMail. I don't understand what you meant by Unicode not being mature enough to support multilingual emails. Modern email clients like Netscape7/Mozilla, MS Outlook (Express), and Mutt support UTF-8 very well. Actually, it is not Unicode which is nt mature enough. It is SMTP, the core mail transport protocol. It is not 8 bit clean. It is very clear in the RFCs that only 7bit data is allowed over the wire. There are various extensions and kluges described in various RFCs (ESMTP, MIME, etc. ) but they are not universally implemented at the server transport layer, let alone at the client layer. So Unicode falls into a (very large) class of encodings that are not safe to pass over SMTP because they use 8 bits for the encoding of at least some characters. This is a well know problem, and some mail servers do not follow the SMTP RFC exactly in that they do not specifically strip the 8th bit of all data and turn it to 0. If you are lucky and all th e mail servers on the path between you and your recipient act this way, then 8 bit data will go through. But for arbitrary email from one address to another, you can't rely on it. Barry Caplan www.i18n.com
RE: UTF-Morse
At 02:37 PM 11/22/2002 +0100, Marco Cimarosti wrote: Otto Stolz wrote: Marco, you shall be called Marcone, or even (granting a Pluralis majestatis): Marconi ;-) And each element shall be called a Morsel Barry
Re: Patent on æ ø å
I met these guys at a trade show a couple of years ago and without know about this claim to fame ended up discussing internationalized URLs. IIRC they mentioned something about a patent. I just assume that whatever working groups are standardizing international DNS are working around it. Barry Caplan www.i18n.com At 08:24 PM 11/22/2002 +, Michael Everson wrote: Can there possibly be any truth in any of this? The following is an article in the Danish paper Information: http://www.information.dk/Indgang/VisArtikel.dna?pArtNo=136309 Do you know anything about this. It is supposedly the company Walid (http://www.walid.com/) that has patented the transformation of non-a-z for use in URLs. An article in CumputerWorld (admittedly a year and a half old) - http://www.computerworld.com/managementtopics/ebusiness/story/0,10801,59043,00.html - has some references, among other things to the text of the patent. The Danish site Softwarepatenter.dk has it also: http://www.softwarepatenter.dk/walid.html. It is quite new there. Is this whole thing just hoax? -- Michael Everson * * Everson Typography * * http://www.evertype.com
Re: Speaking of Plane 1 characters...
At 05:47 PM 11/11/2002 -0500, John Cowan wrote: Michael Everson scripsit: The scale in question is analogous to a temperature scale, not a reptilian one. Now I very *seriously* don't get it. A temperature scale enumerates the degrees -273, -272, -271, ..., 0, 1, 2, ... in order. When you ask What is the temperature?, you are actually asking What is the scalar value of the temperature? The Unicode scale enumerates the characters 0, 1, 2, ... 10. Unicode scalar values are points on this scale, just as temperature scalar values are points on the (Celsius) temperature scale. Well, not exactly...temperature is an arbitrary but standard measure of a continuous physical property. The multiple well known scales attest to that. But code points are absolute points, not continuous. And because one character has a greater encoding value does not make it greater then in any useful sense. Basically, we are talking about continuous ordinal scales vs discrete cardinal scales. Hardly analogous at all IMM. Barry Caplan www.i18n.com
Re: A .notdef glyph
At 12:51 PM 11/7/2002 -0700, John Hudson wrote: Inspired by this, I have made a new .notdef glyph: http://www.tiro.com/transfer/notdef.gif Can you provide a document which provides this in context and with the traditional rectangle? Maybe at various point sizes? Looks a lot like pop art to me but I wouldn't head off the discussion yet Barry Caplan www.i18n.com
Re: Character identities
At 04:39 PM 10/28/2002 -0600, David Starner wrote: But think of the utility if Unicode added a COMBINING SNOWCAP and COMBINING FIRECAP! But should we combine the SNOWCAP with the ICECAP? (-: Unicode captures the ice-age during the global warming era! Do we have codepoints for images found on the walls of caves? :) Barry www.i18n.com
Re: The character @ and gender studies...
Yes - imagine the burden on open relay mailers when they try to blast spam to ill formed email addresses they harvested! Hey wait - maybe this is a *good* idea! Barry www.i18n.com At 02:12 PM 10/25/2002 +0100, Michael Everson wrote: At 05:31 -0700 2002-10-25, Ramiro Espinoza wrote: In some latin countries the people involved in gender studies are using the character to mean a/o. Example: Tods nosotrs (instead of todos nosotros -All of us-). They try to give a male and female approach to the spanish generic words. That's pretty horrible. Why don't they just use the letter schwa? :-) -- Michael Everson * * Everson Typography * * http://www.evertype.com
Re: Origin of the term i18n
At 10:25 PM 10/14/2002 -0700, you wrote: Hmmph. It was a mildly interesting question at first, and it wouldn't have been too bad to see six or eight responses, but by my count we are up to 52 messages in this thread. (53, counting this one.) The participants have either fallen into a religious debate over which group or individual first came up with the idea -- as if that could ever be proved conclusively -- or have started a fad of coining silly new I don't see it as a religious debate or even a debate at all - after all, the conclusion was for all intents and purposes on my web site already. What is more interesting to me is an exploration of the history of internationalization now that we have more or less settled when i18n was coined. The history is goes through a period of hand wringing about what to even call what we now know as internationalization and localization. It wasn't always so clear cut - I made some calls to people I know who aren't in this community anymore but who were long ago who might provide some insight. I have an article written for me last week by the source in my article last week at my request covering some of the history - further back than we have covered in this thread. I intend to post is ASAP on i18n.com except I had a server crash over the weekend. Hopefully that will be fixed in the morning and I can get the article to you. There is an interesting twist in the story about why, at that time and place, internationalization itself was not sufficient as Mark suggested and it is persuasive to me. Then I intend to raise the question of those who were around longer than me of just how far back does the idea of internationalization actually go and when was that term first used. To me, the two holy grails of computer science from day one have been good chess playing programs and machine translation. So at least back into he mid 1950s there was a need for multilingual computing of some type. I am sure there was a lot of roll your own techniques for a good long time. When did these techniques get a name at all, and what was the name and definition? Was it something other than internationalization? If so how did it morph to what we know now? when did localization come into it? These are important historical questions and I think wholly appropriate for this list. You won't see *this* happen every day, but I'm in almost total agreement with Mark Davis. Some of these number-based abbreviations may be useful at times, but for the most part they're like emoticons -- overuse them, or cross the line inventing new ones, and they immediately become trite and cutesy. One of the signs of a mature specialty is a set of jargon and a set of inside humor. To me, l10n and i18n are the only ones we should use everyday. I respectfully disagree about g11n. The rest may be overdoing it a bit but I see the point if they express a concept of i18n/l10n as applied to a specific region or locale beyond the word spelled out itself. that is the power of jargon and branding both. It has nil to do with Unicode. My research over the last week indicates that the origins of Unicode are very definitely of the same era and from the same community of the people who brought the idea of internationalization to a critical mass, and coined the term i18n. One has not been separable from the other since at least 1989. I can do all that, if it would help kill this thread. Personally I would love to see it all end up being moved to i18n.com. There has been a fair amount of off-list discussion going on, btw. Barry Caplan www.i18n.com
Re: Origin of the term i18n
At 12:37 AM 10/15/2002 -0700, Doug Ewell wrote: Barry Caplan bcaplan at i18n dot com wrote: What I am arguing against is going hog-wild making up new obscure abbreviations from the same template, and clogging the Unicode list with them. Anything beyond i18n and l10n is tantamount to the man with glasses smoking a cigar and drooling type of smiley. Well, some were used in jest by correspondents who often engage in wordplay on list and off list truth be told. But I pointed out that the scheme is a meme picking up steam, and not just in software. I didn't make up a12n, even though I hadn't seen it used before. I also didn't make up c17g or m17n. I provided evidence of my claims that this is spreading by pointers to the sites. The only reason I did that is because someone (Mark I think but I could be wrong) objected the entire abbreviation scheme. the point is it is not going away and it will probably be used more and more in different types of places. It occurred to me the other day, I haven't had a chance to check this and maybe someone else will, that all 4 character domain names under dot com domain, which means there may be a lot more sites of the form xdx.com or xddx.com. Barry Caplan www.i18n.com
add a12n to the list...
http://lists.kabissa.org/mailman/listinfo/a12n-collaboration wasn't there a Red Hat Chili Peppers song called c13n? Barry
Re: Origin of the term i18n
At 11:11 AM 10/11/2002 -0700, Mark Davis wrote: Sorry to appear the curmudgeon, but I've never seen any but a relatively few people use this goofy form of abbreviation, and then for only a few of the words on your web page. A search for normalization and Unicode yields 32,800 enties on Google. A search for n11n yields 3. I have seen m17n come out of japan and I saw a similar term, algorithm misapplied in a totally unrelated context at http://www.christadelphian.org/MEMBERS/index.htm: Welcome to the inside of C17g. that's Christadelphian.org shortened - there are 17 characters between the C and the g of the name... it saves a lot of typing Not a trend. Not a trend but a meme Mark, I am curious why you find this term so distasteful? Is it the algorithm itself or just a general objection to acronyms and the like? Or something else entirely? Barry Caplan www.i18n.com
Re: [nelocsig] Re: Origin of the term i18n
At 02:49 PM 10/11/2002 -0400, Tex Texin wrote: According to XenCraft, if the software industry were to exert its ability to influence the English language thru its control of message catalogs used in software thruout the world, numeronyms (n7ms) could replace words completely by the year 2016 (this is the year not numeronym). The research analysts at i18n.com differ in their analysis. They assure me that the i18n.com developers can write a Apache module that would convert pages in encoded with characters from the traditional single byte encodings, such as the ISO-8859 series, to the new format in approximately 15 minutes. Any site that is on a server running Apache with mod_perl would then be automatically available in this format with no further intervention by the site's authors or owners. Planned follow on projects include forming a committee to extend precisely how the algorithm should apply to languages with more complex writing systems, creating a proxy server that browsers can use to convert pages from non-Apache servers, and adding support for various wireless browsers. Once proper funding is secured for the crack i18n.com development team, the conversion (c9n) and obsoletion (o8n) could literally be available overnight. Barry Caplan www.i18n.com
Re: Origin of the term i18n
At 12:20 PM 10/11/2002 -0700, Mark Davis wrote: Mark, I am curious why you find this term so distasteful? Is it the algorithm itself or just a general objection to acronyms and the like? Or something else entirely? I find this particular way of forming abbreviations particularly ugly and obscure. I think it is a meme that is catching on and it serves various purposes more important than saving keystrokes: - these are important words that describe entire fields of study in many specialties - many of them (internationalization, globalization, e.g) are in the common vernacular, with vague denotations and possibly negative connotations in the general public - As such the words are seriously overloaded and confusing - Not only that, but they are spelled differently in various parts of the English speaking world, which affects indexing. - They are long and hard to spell for non-native speakers (and probably most US native speakers too) - They are toungue twisters for all, especially for some non-native English speakers - The overloading of definitions, even within scholarly fields, is calling out for a separation and branding (do a search on localization and see how many branches of science you get) - Long words really suck for design purposes. You would be limited to about 9 point type on your business card if anything other than your title included Internationalization I am working on digging up some deeper history that might shed more light on how i18n was coined initially so stay tuned As for Apple using internationalization internally by 1985, that would be consistent with other evidence of the age of that term wrt (oops with respect to) computer software. But lets not hold Apple up as a company as a corporate bastion of clear terms. The public-facing entire corporate branding strategy since the 1984 release of the Mac has been to *not* use functional terms for products. This is just now beginning to change with iPhoto, etc. The strategy has always been anti-Microsoft in this regard, and Microsoft has always preferred generic terms wherever possible. So if Apple still does not use i18n in its docs then it is business as usual wrt to contrariness to Microsoft's approach but *not* business as usual wrt the rest of Apple's history. This is an interesting place for Apple to be (no pun intended) Barry Caplan www.i18n.com PS - I just checked on developer.i18n.com - it is indeed devoid of references to i18n save a couple of Java APIs and totally devoid of l10n. This must be a long-term enforced policy as Mark hinted - I'd love to speak to whoever came up with it - that it could stick for at least 17 years given the changes at Apple is pretty remarkable in itself!
Re: Historians- what is origin of i18n, l10n, etc.?
There is a link with the story on the fron page of www.i18n.com Barry Caplan Publisher, www.i18n.com At 02:02 AM 10/10/2002 -0400, Tex Texin wrote: I was asked about the origin of these acronyms. Does anyone know who created these or where they were first used? tex -- - Tex Texin cell: +1 781 789 1898 mailto:[EMAIL PROTECTED] Xen Master http://www.i18nGuy.com XenCrafthttp://www.XenCraft.com Making e-Business Work Around the World -
Re: Historians- what is origin of i18n, l10n, etc.?
At 08:35 AM 10/10/2002 -0700, Rick wrote: The earliest reference I can find to i18n in my old e-mail trail is the following e-mail to the sun!unicode mail list by Glenn Wright. This was Oct 5, 1989. By that time, the term was definitely current, as Mr. Hiura suggests. I registered i18n.com around 94 or so, and the fellow, whose name I am trying hard to recall (first name JR, Australian or British IIRC, red hair), seemed to indicate the coinage was quite some time before that and he was very surprised when I told him how extensive the usage was by then. I'm a jonny-come-lately when it comes to unix and other standards history... is there an searchable archive of windows standards anywhere? How about a cvs server of code? It seems to me that i18n or variants could have made it into code as a function name almost immediately, or possibly even before being put into a standards doc It seems to me that l10n was extant by the time I came to CA ~ 1992. Perhaps Ken Lunde can shed some light - he surely came across a lot of early docs while writing his first book, which was a republication of an online archive he maintained I think. Barry
RE: Historians- what is origin of i18n, l10n, etc.?
How did you find these? I searched on i18n and sorted by date and could not go past the 1000th or so record Barry At 09:52 PM 10/10/2002 +0300, Tor Lillqvist wrote: Well, the first occurence of i18n in Google's USENET archive seems to be http://groups.google.com/groups?selm=5570339%40hpfcdc.HP.COM from Nov 30, 1989. l10n occurs first in http://groups.google.com/groups?selm=1990Aug30.115608.3729%40tsa.co.uk from Aug 30, 1990. --tml
RE: Historians- what is origin of i18n, l10n, etc.?
At 06:35 PM 10/10/2002 +0200, Marco Cimarosti wrote: Radovan Garabik wrote: Google is your friend :-) i18n is first mentioned in USENET on 30 nov 1989, Here is a mention from 1989-12-02 11:24:11 PST only 3 days later: http://groups.google.com/groups?q=i18n+1988hl=enlr=ie=UTF-8selm=454%40longway.TIC.COMrnum=7 that says: 5. Messaging The UniForum internationalization (I18N) folks brought forward a proposal for a messaging facility to be included in P1003.1b. The working group decided that it needs some more work but will go into the next draft. [Editor's note -- The problem being solved here is that internationalized applications store all user-visible strings in external files, so that vendors and users can change the December 1989 Standards Update IEEE 1003.1: System services interface - 5 - language of an application without recompiling it. The UniForum I18N group is proposing a standard format for those files.] This indicates to me that UniForum might be a place to look for earlier references This is a very interesting thread from 1990: http://groups.google.com/groups?hl=enlr=ie=UTF-8threadm=1990Aug30.115608.3729%40tsa.co.ukrnum=20prev=/groups%3Fq%3Di18n%2B1988%26start%3D10%26hl%3Den%26lr%3D%26ie%3DUTF-8%26selm%3D1990Aug30.115608.3729%2540tsa.co.uk%26rnum%3D20
Re: Historians- what is origin of i18n, l10n, etc.?
At 07:34 PM 10/10/2002 -0400, Tex Texin wrote: Mark Davis wrote: We used the term internationalization in Apple in late 85. We might have also used it earlier than that, I don't remember. W0e n3r u2d t1e g1d-a3l, g3y a1d o5e a10n i18n, h5r! Mark, Given the center of work in the i18n and l10n area that has emerged in Ireland (and other places) are you more partial to internationali1ation and locali1ation? :) Barry www.i18n.com
Re: Historians- what is origin of i18n, l10n, etc.?
At 07:34 PM 10/10/2002 -0400, Tex Texin wrote: Mark, that's good to know. I never worked with Apple and so have no Apple doc in my collection. However, the W0e below is a violation of the encoding and is a security risk. I think the algorithm calls for the shortest string, so people can't sneak in extra nulls- W0e W00e, etc. That last one would be W0(2)e. The first is optionally W0(1)e. The (deprecated) part of the pattern was designed by the same folks who add ~20% bandwidth (forget the exact number) to all mime email in order to get it through 7 bit smtp. Barry
What good is our jargon? was: Re: Historians- what is origin of i18n, l10n, etc.?
This is a fair question. Why is jargon useful? It serves to define a group and a concept. the best jargon is memorable, short in name, easy to write, catchy in sound to the ear, and universally able to be written. It helps a lot if the term is not already overridden by another group. i18n and l10n both meet all of these criteria, as do lan and yahoo! and google. In this respect, jargon can become a brand. What is really interesting to me is that the criteria we have as common lore about *why* abbreviations were needed (too long to write and type and too much of a tongue twister) apparently never occurred to other professions that also use internationalization and localization as key terms. I think it is the ability to separate what we mean from what others mean that is an important value of the jargon. Especially since it is not always clear in context which is which, and also especially since globalization has extremely negative connotations in the popular collective mind. Barry Caplan www.i18n.com At 05:12 PM 10/10/2002 -0700, Kenneth Whistler wrote: W0e n3r u2d t1e g1d-a3l, g3y a1d o5e a10n i18n, h5r! What I don't understand, since these a10n's are in such widespread use among programmers and character encoders, is why they don't use h9l, as in i12n, lan, and gbn? --K1n BTW, these aan's are not only o5e, they are also o4e, but unfortunately, not o6e in use.
[ot]Re: unsuscribe
I think I might put it on the list of things to do to patch all open source list management software so you have to triple opt-in: in addition to the usual, you have to confirm you read a message that contains nothing but unsubscribe instructions. Anyone wanna help? :) Barry At 06:51 AM 10/4/2002 -0700, you wrote: Allow us to help you once more: http://unicode.org/unicode/consortium/distlist.html#3 It contains the info on how to unsubscribe, and if you scroll down a bit it gives information on what to do if you have problems unsubscribing. MichKa
RE: The Currency Symbol of China
At 12:50 AM 10/1/2002 -0700, Ben Monroe wrote: For instance, IIRC, Isabella Bird wrote in her (British) English travelogue in the early Meiji restoration era (1878 AD) of travels to Yedo (now commonly called Edo in the literature, and known by its modern name to all as Tokyo). She called Tokyo Tokiyo. Just a small correction. The Meiji Restoration was in 1867 (some historians view it as 1868 though). That's a timezone issue, right? :) Actually the 1878 date I referred to is the date of the travels discussed in the book, not the date of the Meiji Restoration. the book itself, according to my copy from about 100 years later, was first published in 1880. Barry Caplan www.i18n.com
Re: The Currency Symbol of China
At 10:08 PM 9/30/2002 +0200, you wrote: Yen is an ancient on pronunciation for U+5186; today it's pronounced en. Stefan Really? I have no sources either way, but I always assumed yen was a Western transliteration of en, since there is no ye entry in the kana table. Barry Caplan www.i18n.com
RE: The Currency Symbol of China
Wow ! I brought Ben out of lurk status after 6 months! Interesting post too - my limited understanding goes back only to Heian era (~970-1100 AD OTTOMH). That combined with various early western transliterations into what we now call romaji, before Hepburn became semi-standardized. For instance, IIRC, Isabella Bird wrote in her (British) English travelogue in the early Meiji restoration era (1878 AD) of travels to Yedo (now commonly called Edo in the literature, and known by its modern name to all as Tokyo). She called Tokyo Tokiyo. It is these types of early Western writings form Japan where I have seen the Ye used, but since they are also littered with plenty of other examples of weird transliterations, I just wrote it off to that. I also think (but I could be wrong) that ye is not one of the characters in the famous Buddhist poem that uses each of the kana once and only once, and establishes a de facto sorting order by virtue of being the only such poem. OTOH, I am pretty sure that poem is either from or post-dates the Heian era, so it wouldn't rule out your point. Barry Caplan www.i18n.com At 03:16 PM 9/30/2002 -0700, you wrote: Barry Caplan wrote: To: Stefan Persson; [EMAIL PROTECTED] At 10:08 PM 9/30/2002 +0200, you wrote: Yen is an ancient on pronunciation for U+5186; today it's pronounced en. Really? I have no sources either way, but I always assumed yen was a Western transliteration of en, since there is no ye entry in the kana table. Modern Japanese has 5 basic vowels, /a, i, u, e, o/. Old Japanese most likely had 8 vowels, /a, i1, i2, u, e1, e2, o1, o2/. These can further be traced to a proto-Japanese 4-vowel system /a, i, u, o/. In the y-line, there is currently /ya, yu, yo/. During the Nara period where the first extant literature appears, there is evidence that the man'yougana (precursor to modern kana; Chinese characters) regularly distinguished between two types of /e/ (called Kou/Otu or A/B sounds, among others). This is usually taken by most scholars as /e/ and /ye/. By the early Heian period, with the emergence of the kana syllabary, this Kou/Otu distinction vanished, specifically the /e/ and /ye/ distinction by around 938 AD. It is usually assumed that the /e/ and /ye/ (which is written with /e/) merged into [ye] (or [je], if you like). Notice that the Portuguese dictionary of 1603 spells this /e/ as ye. Other documents indicate that this /e/ [ye] must have become [e] (as modern) by 1775 or earlier. Also note that some dialects in Kyushu still retain the [ye] pronunciation for /e/. I do not really have the time to go into more details right now. I hope this will suffice. Ben Monroe
Re: Keys. (derives from Re: Sequences of combining characters.)
U+003C in a way that makes using U+003C with the meaning LESS-THAN SIGN in body text intermixed with markup sections awkward. That feature of XML may not matter for situations involving encoding simply literary works, yet for a comprehensive system which can include the U+003C character with the meaning LESS-THAN SIGN in body text and in markup parameters, it does not suit my need. You may be under the mistaken impression that any but the tiniest amount of raw XML is ever edited by hand. If you think your message creators are going to create your files, XML or Comet Circumflex in the actual markup language, well, that just won't happen often in practice. A UI which handles, well, the User Interface, will be needed, making the choice of markup language moot until it comes to what other systems can accept. It is not a fact that my proposed markup convention, as you call it, is not a good idea. It may be your opinion and it might perhaps be the opinion of some other people. Yet my proposed markup convention, as you call it, is entirely within the rules, for keys generally, as in my original post, and for my comet circumflex key in particular. Know one is saying it is not valid Unicode. From a market acceptance point of view, you have seen a consensus that there are a lot of reasons why it probably is not a good idea, coming from people I know to have an enormous amount of experience in these specific matters upon which to draw these conclusions. I for one would be interested if you could come up with some others whose opinion supports your own, although perhaps off list is the place for that. Why should the discussion be taken elsewhere? It is about the application of Unicode to markup and of one particular application to language translation in a manner where Unicode could be widely used, as the comet circumflex system could be used with all of the languages which Unicode supports. Well, the moderator keeps letting it go on... if not I am willing to carry it ad infintitum on i18n.com - just click on Submit story on any page Actually, I was rather hoping that, with your specific interest in languages that you would have wished to have a try at using the comet circumflex system as one of the features of the comet circumflex system is that it could be used with minority languages as easily as with the major languages of the world. If I may speak for Peter, I think he would be willing to consider it were it XML based. However, I offer the caveat that you may be in for some rude surprises when you find out how hard it is to actually translate beyond the simplest sentences (and sometimes even that) when you parameterize them as you propose. I have been of the opinion for several years as far as localization goes, that it is better just to take out the parameters and list all the possibilities. Of course in the general case that may not be possible, but then you are in the realm of machine translation, which already exists for better or worse. So in your case, you may also need to make a case that your solution is more useful than just listing non-parameterized sentences, yet more likely to provide a useful translation than existing machine translation systems. Based on the example sentences about the weather in (London, Berlin, Tokyo) etc. from your original post, I would say that is a very open question. Barry Caplan www.i18n.coom
Re: Keys. (derives from Re: Sequences of combining characters.)
otherwise, I can write a 5 line perl program to run on a spare machine that will create prior art of every possible combination of characters.. I can let it run forever and hook it to a web server to make it visible too. An added bonus of using the comet circumflex key is that documents containing comet circumflex codes do not necessarily need to contain any characters from the Latin alphabet. Why is this a bonus, let alone an added one? I have a 4 year old niece just learning the latin alphabet and as far as I can tell it hasn't changed since I learned it. There is no +U003C character in that alphabet. In fact, the bonus of using 3C as a delimiter (along with other XML delimiters) is that they are in every legacy encoding, meaning if no Unicode tools are available for editing, a regular text editor can be used and the conversion to Unicode can happen later. Your method requires Unicode support and fonts (not the same thing) at the editing stage, which is not realistic unless you want to limit your community to a few of your closest friends so to speak. No one is suggesting such a system can't be built, only that its usefulness would be strongly limited for a lot of very good reasons. As others have noted, I concur that this is not really a Unicode issue per se, but a software design issue. Barry Caplan
Re: Keys. (derives from Re: Sequences of combining characters.)
At 04:26 PM 9/27/2002 +0100, William Overington wrote: I had not heard the description Message catalog previously, so I can search for that too. I have previously searched under telegraphic code and language and translation. An email correspondent drew my attention to the following list of numbered I have not yet found any example oriented to language translation. Key Unix libraries have used message catalogs as part of the API since time immemorial. Hence any Unix application with even a whiff of a chance of being internationalized is likely to have used those functions. I have not yet found any example oriented to carrying on a complete conversation. I would look for the earliest references to machine translation int he 1940s and 50s, up to the work with Eliza at MIT in the 60s. I think there is an enormous project whose name I don't recall right now going on in Texas, perhaps Austin, which is spiritually derived from Eliza and focused on sending whole, previous composed sentences back conversational style. If you want to find the whole of the literature in this area, I suggest searching Turing Test. A proprietary coding system is a bad idea. Well, it depends what one is trying to do. If one wishes to establish a system whereby proprietary intellectual property rights exist, then a proprietary coding can be a good idea. Various large companies use proprietary coding systems for files used with their software packages. If, however, one is trying to establish an open system, then you might well be right. Or if you want to minimize the amount of reinventing the wheel you do internally. You can easily use a proprietary format outside and XML inside, just as you can use SJIS outside and Unicode for internal processing. Failure to investigate the state of the art, (especially where google is so effortless), means this idea is not pushing any envelope. Well, if you have any specific suggestions of what keywords to use in a search, that would be very helpful. I have given you some. Rather than focusing on pseudo-scientific terms like radiogram, I suggest a starting with a familiarity with the history of computer science, both pure and applied research. The keys idea is pushing the envelope. No it is not. As spin off from this discussion, maybe the XML people, and the Unicode Technical Committee, will do something about having special characters for the XML tags rather than using U+003C and thereby help people wanting to place mathematics and software listings in the same file as markup. Is using U+003C a legacy from ASCII days? Why is it not possible to use signs in XML? Most of my postings in this thread are in response to people asking me specific questions and raising interesting points. That is surely why a discussion group exists. But most of the answers you get are based on a shared technical and educational background which you don't have and/or seem to value. It is difficult to describe but a lot of early computer science research was about how to effectively decompose functionality and data. Sadly, I think a lot of this is being lost. For a more technical starting point, look for the works of Edsger Dijkstra starting in the 1960s. For a less technical point of view, look for The mythical Man-month from the mid 60s (recently updated), and its spiritual followups by Ed Yourdon and Tom Demarco. When I read the responses you get, I have the feeling that the authors have internalized the lessons of these important texts (even if they may not have studied them explicitly). Once you internalize the lessons also, then you will have a better understanding of the points of view you are consistently receiving with friction. I am hoping that I can publish some web pages with some comet circumflex codes and sentences about asking about the weather conditions and temperatures at the message recipients location together with codes and sentences for making replies so that hopefully people who might be interested in some concept proving experiments can hopefully have a go at some fascinating experiments with this technology. Unicode can be used to encode many langauges and it will be interesting to find out what can be achieved using the comet circumflex system. That might be an interesting web site in its own right, but the technology is nothing special and has ben done a million times under a million names and ten million times with no name at all. Barry Caplan Publisher, www.i18n.com
Re: glyph selection for Unicode in browsers
At 02:59 PM 9/26/2002 -0400, Tex Texin wrote: Shouldn't that be something more like: pan-script Unicode-based font? or p8e font? :) Barry Caplan www.i18n.com
Re: no replies
Roslyn, I will head off trouble because for you because your message is likely to be otherwise ignored or semi-flamed. The best place to get information on compiling and configuring php is on a php support or developer list. There must be information on how to subscribe to such lists on the php home page, which I am guessing is php.org. Another great source to find answers that I use at least 10 times a day with a 90%+ success rate is to search on related keywords on google.com and groups.google.com. OTTOMH, in your case I would try searching php enable-mbstring inthose places and see what you find. This list is for questions related to Unicode. That is probably no one has replied previously. Few if any people here are php developers, and even fewer are going to be versed in the details of configuring and compiling php. Hope this helps! Barry Caplan www.i18n.com At 04:35 AM 9/24/2002 -0700, you wrote: aaah finally, one reply to that question!! thankyou BOB. anyways, could anyone tell me how i can recompile php to include mbstring support. i used the ./configure enable-mbstring option,did the make install..etc etc, but i still can seem to run any of the mbstring functions in my php code, i get fatal error: call to undefined function mb_(whatever)...could anyone pls assist me here. thanks regards, roslyn
Re: about starting off
Roslyn, I am working on a postgres database too - I haven't yet gotten to extensively testing the unicode aspects, but be sure to set the character set of the database to unicode when you create it. Otherwise all is probably lost - I don't know that you can simply change the char set later, and if you have to dump and import the data, you'd have to do some sort of conversions. Why bother making extra work for yourself? As for the code in php (I am using Perl myself and something similar applies) every time you manipulate text (every time!) get used to asking yourself if you (or php) are making any assumptions that one byte is the same as one character. The answer needs to be no, but will often be yes. Reconciling these issues is the bulk of making Unicode work for you. Barry Caplan Publisher, www.i18n.com On Thu, 19 Sep 2002, roslyn jose wrote: hi, im new to unicode, and am working on a project in php/postgresql. i need some info on how to start off with unicode. i went thro the web site and only saw explanations on what it is, its char set,etc. do i need to download or install anything to work with unicode, pls let me know soon. and also once downloaded do i need to import any classes or files when working with it, as im scripting in php and html. thanx regards, roslyn
Re: Why w and y are not vowels? [Was: Re: Latin vowels?]
At 04:37 PM 9/9/2002 -0400, John Cowan wrote: [Da][n] [Ko][ga][i], 5 Japanese Syllables, 3 English Syllables 5 moras, 3 syllables, actually. A new vocabulary word for me, so I looked it up... mo·ra n. pl. mo·rae or mo·ras The minimal unit of metrical time in quantitative verse, equal to the short syllable. How does this apply unless I write something like? I think that I shall never see a Kogai lovely as a tree Mora sounds like jargon for a more specialized situation, unless I am missing something ... Barry Caplan http://www.i18n.com
Forwarded question....
Hi Unicoders... I received this question and I didn't have a good answer ...perhaps someone else here can help? I have a Japanese text file in Shift JIS and I need to convert it to escaped Unicode. Does anyone know of any tools or utilities that can do this? The standard character encoding sets available in text editing tools like Hidemaru don't appear to do this. Any suggestions would be helpful. Thank you. By escaped Unicode, she means \u format. Barry Caplan http://www.i18n.com
Re: Revised proposal for Missing character glyph
At 09:49 PM 8/26/2002 -0400, John Cowan wrote: Nowadays, experts can detect mismatched character sets from the nature of the byte barf that appears on their screen. And super-experts can read languages in byte barf as it is not random! Barry Caplan http://www.i18n.com
Re: An idea for keeping U+FFFC usable. (spins off from Re: Furigana)
Yes, yes, I think this is an idea which could fly. --Ken Good. It is a solution which could be very useful for people writing programs in Java, Pascal and C and so on which programs take in plain text files and process them for such purposes as producing a desktop publishing package. Uhh, I think Ken's message was entirely sarcasm or some higher form of rhetorical humor whose obscure name slips my mind right now. The suggestion to use html as an extension was the give away - I was laughing out loud from that point on - his point was that the technology to do what you want already exists it is called HTML and it is displayed by browsers and so forth. Barry Caplan www.i18n.com
OT Laugh for the day - I liked the title of this security related article
and the first few sentences as well Barry Caplan www.i18n.com http://www.securitymanagement.com/library/000599.html How to Keep Out Bad Characters By DeQuendre Neeley The business world is one of constant motion. But it is not just people who are on the move. It is also information. Businesses today depend on the efficient exchange of information, for which they rely increasingly on the Internet and other computer networks. Unfortunately, in the digital world, as in its physical counterpart, bad characters will sometimes try to slip in with the good.
Re: The standard disclaimer
At 10:08 PM 7/24/2002 -0700, Doug Ewell wrote: Tex Texin tex at i18nguy dot com wrote: Hall? Check? Re- ? Water? No, too late. John Hudson already won this round, for finding a way to bring it back on topic. (Turns to John and bows, Pat Morita style.) Congratulations, master. And for that we give him high - Barry Caplan www.i18n.com
Re: Unicode certification - was RE: Dublin Conference:
At 08:07 AM 7/25/2002 -0700, David Possin wrote: After that we can add the chocolate sauce, the cherry, and the sprinkles of Unicode. The special Unicode compliance tests are harder to define and to perform, I agree. But in most cases these issues haven't even been implemented yet. But isn't the reason someone would want to quantify compliance is precisely to find out what is implemented and what is not? Barry Caplan www.i18n.com
Re: Abstract character?
I usually define an abstract character in talks I give as an element of a writing system that you care about, independent of glyphs, and certainly independent of endings or specific code points. If it could be described more precisely than that, it wouldn't be abstract, would it? :) This is usually brought up in a series of definitions leading from character (what we are referring to here as abstract character, and then: - character list - a list of characters one is interested in - character set - a list of character lists, which may or may not be ordered, but still has no codepoints - encoding scheme - an algorithm for assigning code points to a character set - code point the representation of an abstract character in an encoding scheme - font - a series of glyphs that are used to display a characters represented by code points, in their immediate context All of this is filled with examples - building to an explanation of Unicode. For example, wrt abstract character, I ask the audience to ponder if upper case A and lower case a, are the same abstract character. Also, I ask them to ponder if lower case a displayed in Helvetica is the same character as lower case a in Times Roman. Finally, how about lower case a in 9 point Helvetica and lower case a in 18 point Helvetica? And apropos a thread from last week, Unicode introduces new concepts such as character properties which means the anticipation and intrigue I spend time building in the audience that there is a neat solution to the historical morass I just spent 40 minutes describing, gets thoroughly dashed! Joy! Implicit in this set of definitions is of course that a character may or may not be of interest to all character lists, and therefore may or may not end of represented in more than one encoding. Also note that even when it does end up in more than one, this model in no way implies a round trip capability. This leads nicely into a discussion about some very important aspects of internationalizing code and working with 3rd party components.. Barry Caplan www.i18n.com At 01:38 PM 7/22/2002 -0700, Kenneth Whistler wrote: Lars Marius Garshol asked: I'm trying to find out what an abstract character is. I've been looking at chapter 3 of Unicode 3.0, without really achieving enlightenment. The term Unicode scalar value (apparently synonymous with code point) seems clear. It is the identifying number assigned to assigned Unicode characters. Here is one of my attempts at a more rigorous term rectification: Abstract character that which is encoded; an element of the repertoire (existing independent of the character encoding standard, and often identifiable in other character encoding standards, as well as the Unicode Standard); the implicit basis of transcodings. Note that while in some sense abstract characters exist a priori by virtue of the nature of the units of various writing systems, their exact nature is only pinned down at the point that an actual encoding is done. They are not always obvious, and many new abstract characters may arise as the result of particular textual processing needs that can be addressed by characters. (E.g. WORD JOINER, OBJECT REPLACEMENT CHARACTER, etc., etc.) Code point A number from 0..10; a point in the codespace 0..10. Encoded character An *association* of an abstract character with a code point. Unicode scalar value A number from 0..D7FF, E000..10; the domain of the functions which define UTF's. The Unicode scalar value definitionally excludes D800..DFFF, which are only code unit values used in UTF-16, and which are not code points associated with any well-formed UTF code unit sequences. Assignment (of code points) Refers to the process of associating abstract character with code points. Mathematically a code point is assigned to an abstract character and an abstract character is mapped to a code point. This is distinguished from the vaguer sense of assigned in general parlance as meaning a code point given some designated function by the standard, which would include noncharacters and surrogates. So far, so good. Some questions: - are all assigned Unicode characters also abstract characters? Yes. Or rather: all encoded characters are assigned to abstract characters. (See above for my distinction between assigned and designated, which would apply to noncharacters and surrogate code points -- neither of which classes of code points get assigned to abstract characters.) - it seems that not all abstract characters have code points (since abstract characters can be formed using combining characters). Is that correct? Yes. (Note above -- abstract characters are also a concept which applies to other character encodings besides the Unicode Standard, and not all encoded characters in other character encodings automatically make it into the Unicode Standard
RE: Inappropriate Proposals FAQ
At 01:27 PM 7/11/2002 -0400, Suzanne M. Topping wrote: Unicode is a character set. Period. Well, maybe. But in a much broader sense then the character sets it subsumes in its listings. Each character has numerous properties in Unicode, whereas they generally don't in legacy character sets. Maybe Unicode is more of a shared set of rules that apply to low level data structures surrounding text and its algorithms then a character set. The Unicode consortium very wisely keeps it's focus narrow. It provides a mechanism for specifying characters. Not for manipulating them, not for describing them, not for making them twinkle. All true, except for some special cases (BOM, bidi issues and algoirthms, vertical variants, etc).Not saying those shouldn't be in there, just that they are useful only in the use of algorithms that are explicit (bi-di) or assumed (upper case/lower case, vertical/horizontal) etc. In many cases, these algorthms are not well known, even amongst the cognoscenti, or generally available in nice libraries. Anyone for an open source Japanese word splitting library (I know not taking a look at ICU before I press send is going to come back to haunt me on this, but if it is in there, then substitute something that isn't :) Barry Caplan www.i18n.com
RE: Saying characters out loud (derives from hash, pound,octothor pe?)
At 09:43 AM 7/12/2002 -0400, Suzanne M. Topping wrote: -Original Message- From: David Possin [mailto:[EMAIL PROTECTED]] so now we have a chromatic audio attribute for each character? Don't be ridiculous. Sounds don't have chroma. There will however be a need for tone and accent variation so that proper localization can be executed. ;^P I have been dreaming of the idea of synaesthetic applications for years but haven't come up with a way to do it yet. But sounds absolutely will need chroma, that much I know. And when you say it with feeling, the fonts will literally be perceived as feeling Such an application better not be written for Windows, because the blue screen of death will be felt rather than seen :) Barry Caplan www.i18n.com
Hmm, this evolved into an editorial when I wasn't looking :) was: RE: Inappropriate Proposals FAQ
At 05:13 PM 7/12/2002 -0400, Suzanne M. Topping wrote: -Original Message- From: Barry Caplan [mailto:[EMAIL PROTECTED]] At 01:27 PM 7/11/2002 -0400, Suzanne M. Topping wrote: Unicode is a character set. Period. Each character has numerous properties in Unicode, whereas they generally don't in legacy character sets. Each character, or some characters? For all intents and purposes, each character. Chapter 4.5 of my Unicode 3.0 book says The Unicode Character Database on the CDROM defines a General Category for all Unicode characters So, each character has at least one attribute. One could easily say that each character also has an attribute for isUpperCase of either true of false, and so on. There are no corresponding features in other character sets usually. Maybe Unicode is more of a shared set of rules that apply to low level data structures surrounding text and its algorithms then a character set. Sounds like the start of a philosophical debate. Not really. I have been giving presentations for years, and I have seen many others give similar presentations. A common definition of character set is a list of character you are interested in assigned to codepoints. That fits most legacy character sets pretty well, but Unicode is sooo much more than that. If Unicode is described as a set of rules, we'll be in a world of hurt. Yeah, one of the heaviest books I own is Unicode 3.0. I keep it on a low shelf so the book of rules describing Unicode doesn't fall on me for just that reason. this is earthquake country after all :) I choose to look at this stuff as the exceptions that make the rule. I don't really know if it is possible to break down Unicode into more fundamental units if you started over. Its complexity is inherent in the nature of the task. My own interest is more in getting things done with data and algorithms that use the type of material represented by the Unicode standard, more so than the arcania of the standard itself. So it doesn't bother me so much that there are exceptions - as long as we have the exceptions that everyone agrees on, that is fine by me because it means my data and at least some of my algorithms are likely to be preservable across systems. (On a serious note, these exceptions are exactly what make writing some sort of is and isn't FAQ pretty darned hard. humor Be careful what you ask for :) /humor I can't very well say that Unicode manipulates characters given certain historical/legacy conditions and under duress. Why not? It is true. But what if we took a look at it from a different point of view, that the standard is a agreed upon set of rules and building blocks for text oriented algorithms? Would people start to publish algorithms that extend on the base data provided so we don't have to reinvent wheels all the time? I'm just brainstorming here, this is all just coming to me now. If I were to stand in front of a college comp sci class, where the future is all ahead of the students, what proportion of time would I want to invest in how much they knew about legacy encodings versus how much I could inspire them to build from and extend what Unicode provides them? Seriously, most of the folks on this list that I know personally, and I include myself in this category, are approaching or past the halfway point in our careers. What would we want the folks who are just starting their careers now to know about Unicode and do with it by the time they reach the end of theirs, long after we have stopped working? For many applications, people are not going to specialize in i18n/l10n issues. They need to know what the appropriate building text based blocks are, and how they can expand on them while still building whatever they are working on. Unicode at least hints at this with the bidi algorothm. Moving forward should other algorithms be codified into Unicode, or as separate standards or defacto standards? I am thinking of Japanese word splitting algorithm. There are proprietary products that do this today with reasonable but not perfect results. Are they good enough that the rules can be encoded into a standard? If so, then someone would build an open implementation, and then there would always be this building block available for people to use. I am sure everyone on this list can think of their own favorite algorithms of this type, based on the part of Unicode that interests you the most. My point is that the raw information already in unicode *does* suggest the next level of usage, and the repeated newbie questions that inspired this thread suggest the need for a comprehensive solution at a higher level then a character set provides. Maybe part of this means including or at least facilitating the description of lowlevel text handling algorithms. If I did, people would be scurrying around trying to figure out how to foment the duress.) The accomplishments of the Unicode
Re: What Unicode Is (was RE: Inappropriate Proposals FAQ)
At 03:54 PM 7/12/2002 -0700, Kenneth Whistler wrote: Suzanne responded: Maybe Unicode is more of a shared set of rules that apply to low level data structures surrounding text and its algorithms then a character set. O.k., so now before asserting or denying that Unicode ... is a shared set of rules, it would be helpful to pin down first what you are referring to. That might make the ensuing debate more fruitful. Actually, it was me, not Suzanne, that called Unicode a shared set of rules. As Ferris Bueller once said I'll take the heat for this. I was aware of all of the uses of Unicode that you listed. I have no quarrels with any of them. They do point to the fact that the word is overloaded with definitions. Which means that readers have to choose the appropriate one from the context. The context of the statement above is that the Unicode referred to is the Standard, and all associated documentation. Not Unicode the Consortia which manages the Standard. Not Unicode the way of life :) I did intend to throw open a debate about the long term future of Unicode the Standard and by extension Unicode the Consortia. Since Suzanne is writing What is Unicode and is not Unicode FAQ, I think the answer to that is going to be very definitely colored by the answer to the related question What will Unicode become?, e.g. Unicode 6.0, 7.0, 8.0, etc. See my previous msg, subject line: Hmm, this evolved into an editorial when I wasn't looking :) for some thoughts on that subject. Barry Caplan www.i18n.com
Re: Q: Filesystem Encoding
At 08:43 AM 7/10/2002 -0400, Jungshik Shin wrote: In short: should I still stick to ASCII alone in filenames, or are there filesystems where I really don't have to anymore? Thanks in advance. Definitely/unconditionally no for NTFS. As for Linux ext2(and most other Unix fs'), unless you mix up UTF-8 and legacy encodings (which you wouldn't because you have never used non-ASCII), it's all right to switch to UTF-8 and use non-ASCII chars. But be aware that such filenames may or may not be able to be transferred *across* file systems. Not only that, but, although I haven't tested in detail for a while, I would not be fully comfortable with middleware that is responsible for managing file names across systems either, such as FTP, email attachments, and Samba. Particularly in the case of FTP and email, just because one client works does not mean another one will. Also keep in mind that even if the file name transfers exactly correct, there is no guarantee, except, for ASCII characters, that the system will have fonts to display the file name. Barry Caplan www.i18n.com
Re: Saying characters out loud (derives from hash, pound, octothorpe?)
At 11:37 AM 7/5/2002 +0100, Michael Everson wrote: Also, how does one say the U+007E character out loud while reading out the address of a web page? Tilde. Get real, William. FF5E is colloquially known as a wave in Japanese, IIRC, and hence 007E is a small wave or half width wave. Barry Caplan www.i18n.com
Re: Inappropriate Proposals FAQ
At 10:01 AM 7/2/2002 -0400, Suzanne M. Topping wrote: I have a few ideas for fictional proposals to use as examples (my room layout idea, and Mark's 3-D Mr. Potato Head representation), but I could use another one or two if anyone feels creative. The closer to being believable, the better, I suppose. (An alternative would be to use real-life proposals, and state why they were not accepted, but I thought it more politic to keep it fictional...) There was a discussion last year about a symbol to represent pi/2 or pi/4 or something like that. If you want to fictionalize that to some other fraction of a mathematical constant, that might work (e/2 perhaps?) Barry Caplan www.i18n.com
Re: Creative IDN Opportunities
I think it is somehow tied into the whole ICANN political mess. I haven't sorted it out yet but I am interested if anyone else has... Barry Caplan www.i18n.com At 02:13 PM 6/20/2002 -0400, Suzanne M. Topping wrote: Couldn't help but cringe at the last line of this press release. Can anyone give me a quick update on the status of IDN standards work? It's been a while since I checked it out...
Re: Support for Japanese characters
At 12:21 PM 3/8/2002 -0600, Eric Ray wrote: Need help please. Problem: 1. Current library built for unix and supports ASCII characters only. 2. This library must now accept wide characters from Japanese client. You need to doublebyte enable the library except for the most trivial uses. Doing so is not trivial. Facts: -- 1. The library does not really evaluate the Japanese characters to make logical decisions. If the data just passes through, that might be relatively trivial. We believe base64 encode the character array to avoid any bad things happening in the code (such as hitting a null value or other values that could potential cause problems). Is the (non-Japanese) data already base 64 encoded? If so, why? Why create trouble handling that just to avoid checking for null values? Anyway, if you really aren't going to process the Japanese characters in this library except to pass them thru, then you need to take the Japanese text, base64 encode it, and then pass it to the library the usual way. Then retrieve it the usual way and base64 unencode and voila! Of course this may just move your questions to other parts of your program, but you haven't asked about those places. without knowing what the application is or what the configuration is except unix it is hard to say more. 2. Cannot rewrite library in time allowed and don't really need to based on Fact item #1. Plus, pressure to get product to market is greater than internationalizing the library. This is probably a guaranteed method to fail in Japan. Japanese users and your Japanese partners if you have them have had many years of experience with bad software form the us that claims to work. They will know how to break it quickly. Then you will learn a hard lesson about doing business with Japanese while not taking heed of the well known requirement for quality. What I need help with: -- 1. How do I set up an ASCII based unix machine, test application and test environment to send Japanese characters to the library in question. I see from your web site that the application is likely some sort of encryption device, possibly for email. Having run the Japanese software group at an email company in the past,I can tell you Japanese email is fraught with its own perils under any circumstances. Without knowing what the actual channel is that you want to pass the text thru, it is hard to say how you will want to test it. You also have not described the time schedule and why you consider it tight. Is it safe to assume that your plan to counteract any lack of experience and time schedule is to spend money to hire someone who has both? 2. Do I need to create hex input or binary input to represent Japanese characters. Since I'm using a standard keyboard how do we get Japanese characters into the application? Use the Japanese Input Method Editor supplied with or for the operating system. But that does not guarantee that the data will actually get to the application properly if the application has not been coded to handle it. This is part of internationalizing your code, and now you see why skipping corners during the initial development is coming back to haunt you. 3. What am I not considering here? What gotchas will I come across by not making my library i18nized? The gotchas are going to fall into the categories of Won't work or Data passes thru ok, but the rest of the application doesn't know how to handle it. OTTOMH, I would watch out for endianness when you base64 encode Japanese multibyte text too. Probably OK, but worth taking a close look at. Unfortunately, I've never done any i18n or l10n work before so I'm really having trouble figuring out where and how to get started. Any advice is appreciated. There is no magic bullet here in general. if Zixit values the opportunity in Japan, I would suggest you be open to the offers you are sure to get from experienced folks to assist you. If you don't get any, contact me off-list and I will put you in touch with some. Barry Caplan Publisher, www.i18n.com
Re: [OT beyond any repair] House numbers
At 01:16 PM 3/1/2002 -0500, John Cowan wrote: What about the 100 house numbers per block convention? This does not hold in the older parts of older US cities (New York does not obey it south of 8th St or so), but is quite general in the US as a whole It holds for the whole of Baltimore and extends on at least the major arteries into the suburbs Some suburbs reset the count from their own city centers, and that may or may not include the main arteries I am not aware of any exceptions at all in Baltimore city Note that the main arteries are more or less in an spoke from the center of downtown All blocks are numbered form the hub (baltimore (east/west) at charles (north/south) Thus all 2800 blocks are roughly equidistant form the center It is less well known that even numbers are on the left as you head out of town in any direction and odd numbers on the right Anyone who wants to reach me by snail (extremely snail) mail, can do so at: Cowan 12017-0042 USA Doesn't every address that USPS delivers to have a unique 9 digit zip code, making house numbers a legacy? From the US, couldn't I get a letter to you just by putting 12017-0042 on the envelope? Barry Caplan Publisher, wwwi18ncom
Need a quick font? make your own!
This is pretty interesting. Is it art, is it a toy? Make your own TT fonts created by a genetic algorithm! http://alphabet.tmema.org/ Best Regards, Barry Caplan www.i18n.com - coming soon, preview available now News | Tools | Process for Global Software Team I18N
Re: Off-Topic (Re: This spoofing and security thread)
This was discussed in a book I recently read, called Code (don't recall the author right now). Apparently the Danish (I think) translation has an error, but only one. I guess the proof reader was not familiar with grep :) Barry At 08:23 AM 2/14/2003 -0500, Elliotte Rusty Harold wrote: At 11:59 PM -0500 2/13/02, John Cowan wrote: There is an English translation (or translation): The Void, wherein the hero, Anton Voyl, becomes Anton Vowl. There are German and Danish translations too. Do you happen to know if these translations also avoid the letter e? German's especially impressive since I think e makes up about 20% of the letters in typical German. -- +---++---+ | Elliotte Rusty Harold | [EMAIL PROTECTED] | Writer/Programmer | +---++---+ | The XML Bible, 2nd Edition (Hungry Minds, 2001) | | http://www.ibiblio.org/xml/books/bible2/ | | http://www.amazon.com/exec/obidos/ISBN=0764547607/cafeaulaitA/ | +--+-+ | Read Cafe au Lait for Java News: http://www.cafeaulait.org/ | | Read Cafe con Leche for XML News: http://www.ibiblio.org/xml/ | +--+-+
Re: Unicode and Security
At 15:53 -0500 2002-02-07, Elliotte Rusty Harold wrote: For text files, probably not. But for the domain name system the world very well might. Indeed, maybe it should unless this problem can be dealt with. I suspect it can be dealt with by prohibiting script mixing in domain names (e.g. each component of the name must be entirely Greek or entirely Cyrillic or entirely Latin etc. Note: something_Cyrillic.something_greek.com is OK.) Does anybody really need mixed Latin and Greek domain names? Not only that, why limit the alleged security risks to domain names? Why not the part of an email address before the @? the allowed characters for that are specified in a different RFC than that for domain names, and has nothing to do at all with DNS. And how many variations of numerals are there in Unicode? After all, every place you could use a domain name, you could use the actual IP address too. How many ways might that be spoofed? Barry
RE: Unicode and Security: Domain Names
I want to review these documents, but since time is short, maybe someone can answer my question... Are the actual domain names as stored in the DB going to be canonical normalized Unicode strings? It seems this would go a long way towards preventing spoofing ... no one would be allowed to register a non-canonical normalized domain name. Then, a resolver would be required to normalize any request string before the actual resolve. So my questions are: 1 - Am I way off base here? If so, why? 2 - If not, is it already addressed in these docs? 3 - If it is not in the docs, and the request makes sense, then I will make the effort to beat the deadline, which is next Monday. Thanks! Barry At 10:37 AM 2/8/2002 -0800, Yves Arrouye wrote: Moreover, the IDN WG documents are in final call, so if you have comments to make on them, now is the time. Visit http://www.i-d-n.net/ and sub-scribe (with a hyphen here so that listar does not interpret my post as a command!) to their mailing list (and read their archives) before doing so. The documents in last call are: 1. Internationalizing Domain Names in Applications (IDNA) http://www.ietf.org/internet-drafts/draft-ietf-idn-idna-06.txt 2. Stringprep Profile for Internationalized Host Names http://www.ietf.org/internet-drafts/draft-ietf-idn-nameprep-07.txt 3. Punycode version 0.3.3 http://www.ietf.org/internet-drafts/draft-ietf-idn-punycode-00.txt 4. Preparation of Internationalized Strings (stringprep) http://www.ietf.org/internet-drafts/draft-hoffman-stringprep-00.txt and the last call will end on Feb 11th 2002, 23h59m GMT-5. There is little time left. YA
Re: Unicode and Security
At 12:22 PM 2/7/2002 -0500, Elliotte Rusty Harold wrote: I've been thinking about security issues in Unicode, and I've come up with one that's quite scary and worse than any I've heard before. It uses only plaintext, no fonts involved, doesn't require buggy software, and works over e-mail instead of the Web. All it requires added to the existing infrastructure is internationalized domain names. So in the hope that this becomes a self-defeating prophecy, here's the scenario: snipCan you please update me on your budget? Bob, noticing that the e-mail appears to come from Alice, whom he knows and trusts, fires off a reply with his confidential information. Only it doesn't go to Alice. It goes to me. I can then reply to Bob, asking for clarification or more details. I can ask him to attach the latest build of his software. I can carry on a conversation in which Bob believes me to be Alice and spills his guts. This is very, very bad. This is precisely the problem digital signing is meant to solve. Signing means that Alice has encrypted the message with her private key before sending to Bob. Bob then unencrypts the message using Alice's public key. If the message does not unencrypt, then Bob should not trust that the message is from Alice. This algorithm works independent of transport mechanism (email, etc.), or domains. Alice's key stays with Alice,not with the domain. Of course, how you exchange trusted keys in the first place is another matter, but I am sure this is all covered on a security FAQ somewhere. E-mail forgery has been a problem for a long time, but it's always been one-way. You couldn't trick somebody into sending you a reply because doing so required using a different e-mail address than the one they expected, thus revealing the message as forged. There are many many ways to get a response from someone via email, even if the address is not recognized or forged. Most involve social engineering approaches more than anything else. My mailbox filled with spam will attest the that! With a Unicode enabled mailer, that's no longer true. If the fonts Bob (not me, but Bob) chooses for his e-mail program do not make a clear distinction between an o and an omicron, this works. There are lots of other attacks. The Cyrillic and Greek alphabets provide lots of options for replacing single letters in Latin domain names. Unless all messages are signed (technically feasible) , then there is no trust at all. When Outlook/Exchange supports, in fact requires, messages to be signed, then this problem will start to dwindle away, at least in the email realm. Of course if there is a method to judge the level of trust for properly signed messages that arrive from folks you don't know (a human failability), then knowing the origin of the message might not help much either. My inbound spam can be verifiably signed, but it is still spam. In other words, it's not our fault. Blame the client software. Sounds distressingly like the Unicode Consortium's approach to these issues. Interestingly, my attack works with a single character representation (Unicode). Your attack is only a social engineering attack, not a technical weakness inherent in any protocol, or character set (even though there may be such issues) Barry
Re: Unicode and Security
At 02:42 PM 2/7/2002 -0500, Elliotte Rusty Harold wrote: At 11:34 AM -0800 2/7/02, Asmus Freytag wrote: But, as the discussion shows, spoofing on the word level (.com for .gov) is alive and well, and supported by any character set whatsoever. For that reason, it seems to promise little gain to try to chase the holy grail of a multilingual character set that somehow avoids the character level spoofing, if the word level spoofing can go on unchecked. Burglary at the broken window level is alive and well. Therefore there's little point to putting locks on doors. I hope the fallacy of the above is obvious, but when translated into the computer security domain it's all too common a rationalization, as this thread demonstrates. It is not obvious to me that there is a fallacy at all, let alone what it is. Instead of stating that we should be able to infer the fallacy, please state it, and a possible solution explicitly. It seems to me we have already proposed working, and available (if not elegant) solutions to the issue of trust of content. Now the issue seems to be trust of domain names. My browser already has built in support for identifying groups of domains I can assign varying levels of trust to, base on certificate technology. NOt elegant, but available. Similarly, something for email could e done using today's technology. More importantly, wrt DNS: under what circumstances can you, today, or in the future, actually trust that the address resolving information you get is accurate? None, really. The packets go too many places on the way that could change them. And even if it is accurate, which of course it usually is, how can you be sure that packets at a lower level will actually be delivered, as intended, and not misdirected or copied elsewhere? You can't, really, for the same reason. This is the nature of the system, especially at the IP level. None of this has to the slightest bit to do with what characters are used for domain names, and hence will not go away with any changes to DNS. It has everything to do with why data should be encrypted if you care about security of data. There are many ways to socially engineer someone into doing something they shouldn't do. This is just one of them, and one that's mostly theoretical at the current time. However, we still need to plug the hole. That there are other, less damaging holes (or even more damaging ones) is no excuse for not fixing this one. The source code for bind is available. Go ahead and fix it. good luck persuading people to upgrade such a mission critical part of the internet though. Just to pull a number out of a hat, imagine there are 10,000 attacks a day using spoofing in the current system. Is this any justification for opening up a hole that will add 10,000 more? Of course it's not. I still don't see the attack as anything but social engineering. That a telemarketer or door-to-door salesman can get my credit card info by misrepresenting their intent does not mean there is a flaw in either the phone numbering scheme, or the credit card system. Your attack is exactly analogous. Barry
Re: Unicode and Security
At 04:17 AM 2/8/2002 +0330, Roozbeh Pournader wrote: On Thu, 7 Feb 2002, Elliotte Rusty Harold wrote: Trust is a human question decided by human beings, not a boolean answer that comes out of a computer algorithm. I can trust that the message I'm replying to came from a person named Barry Caplan even if I have no proof of that whatsoever. Or that the book you're reading has been written by a person named Nicolas Bourbaki... (Sorry, I love the idea. I could not stop myself.) roozbeh On what basis can Elliotte know that a message purported to be from Barry Caplan actually is from Barry Caplan, or that there even is a Barry Caplan? The person writing this, who claims to be Barry Caplan, has never met anyone named Elliotte Rusty Harold to the best of his recollection. He (Barry Caplan) does claim to personally be acquainted with many others on this list though - hi - sorry I missed you in DC! :) Best Regards, Barry Caplan www.i18n.com - coming soon, preview available now News | Tools | Process for Global Software Team I18N
Re: Unicode and Security
At 11:54 AM 2/6/2002 -0700, John H. Jenkins wrote: The original focus was on digital signatures, and I still don't get the objection. Because I don't know *precisely* what bytes Microsoft Word or Adobe Acrobat use, do I refuse to sign documents they create? Is that the idea? I mean, good heavens, I don't even know *precisely* what bytes Mail. app is going to use for this email. Should I refuse to sign it? I don't think the main issue is whether or not you should sign it. I think the main issue the original poster tired to raise, is that as the recipient of such a signed document, he is not persuaded he should trust it. This is a serious issue, although as several have noted, not a Unicode-only one. No one doubts the security of the encryption algorithms used for signing. But the issue of trust is critical. In the analog world, people are expected read and understand documents, and in general, the worlds legal systems are set up to recognize that a signature (or stamp or seal or whatever) is binding evidence that such care was taken (even if it wasn't really taken). In the digital world, individual behavior and legal processes both may not be so well formed to support the technology of digital signatures. I believe this is what the original point was. IANAL, but enforceability of such a kluged, digitally-signed document seems in doubt. There is a long history of that type of contract support in our US legal systems, and probably others as well. There will surely be difficulties adapting it to the digital domain, but I think the basis for support is already there Anyway, it is not, but maybe should be well known, that the purpose of digital signatures, is to verify who the sender is, and to verify that the document has not been changed in transit. That it might contain tricky language or information is an important thing to note, but the reader still needs to rely on the document's contents with the same skeptical eye as if it were not printed. Just as the Unicode bi-di algorithm makes no claims at reversibility, digital signing algorithms make no claim that the signed contents are correct,or even useful.
Re: Unicode and Security
At 02:15 PM 2/3/2002 +0900, you wrote: On Sat, 2 Feb 2002, David Starner wrote: [...several lines cut to save room...] I think I'm missing your perspective. To me, these are minor quirks. Why do you see them as huge problems? I am thinking about electronically signed Unicode text documents that are rendered correctly or believeed to be rendered correctly, still they look different, seem to contain additional or do not seem to contain some text when viewed with different viewers due to some ambiguities inherent in the standard. An electronically signed document allows you to trust who wrote it, and that the *byte* sequence* hasn't been tampered with. It implies nothing at all trust wise about what software you should use to interpret it. You would go through the trouble to verify a signature, but trust the .doc extension and some machine's implementation of Word with your money? Makes no sense. That being said, identifying security issues of existing programs and or protocols when they intersect with Unicode-based data is an important issue, and one I intend to cover regularly on www.i18n.com, once it launches this month. For those of you that have specific issues to write about, or are interested in providing a series of security-related articles (length and frequency TBD, please contact me off-list. I think there are endless examples already out there, to provide, and I know of at least one that is serious. Let's find more! Best Regards, Barry Caplan www.i18n.com - coming soon, preview available now News | Tools | Process for Global Software Team I18N
Re: VIRUS!!!!! (was Re: new photos from my party!)
Yeah, I wrote about that before going to bed last night, and the photos virus *made it through* on a Yahoo Group I am subscribed to, even though apparently the list is set to *no attachements*. Great. Lucky for me I won't let MS Outlook anywhere near any of my computers. At 05:29 PM 1/28/2002 +, Michael Everson wrote: Now, Sarasvati, what did I say about attachments? -- Michael Everson *** Everson Typography *** http://www.evertype.com Best Regards, Barry Caplan [EMAIL PROTECTED] www.i18n.com - coming soon, preview available now News | Tools | Process for Global Software Team I18N
Re: Variation Selection
At 10:29 PM 1/27/2002 -0500, you wrote: In a message dated 2002-01-27 18:51:35 Pacific Standard Time, [EMAIL PROTECTED] writes: First, have we all servers? No. Assuming we all do is no better than assuming we all have broadband or T1 connections. Yes, we do all have servers: Yahoo is your friend - you can get an unlimited number of 6mb (I think, maybe more) accounts for free. Store images in briefcase.yahoo.com/yourid. Store any files in briefcase.yahoo.com/yourid. Also, this list is mirrored on a yahoo group. the group has storage space too. I don't know ho the moderator of that group is, but maybe he/she can assist. In any of these cases, all that needs to be passed to the list is the url. Frankly, the issue of unexpected attachments in email is not the size for me, but it does cause me security concerns. I would much rather decide whether or not to download a file then to wake up one morning with a virus or worse. Best Regards, Barry Caplan [EMAIL PROTECTED] www.i18n.com - coming soon, preview available now News | Tools | Process for Global Software Team I18N
Re: FW: Please help me
If I recall correctly there was a presentation on Uighur an Unicode at the September 2000 conference in San Jose. I think one of the main topics was creating fonts to display the language. Perhaps the talk is archived at the Unicode.org web site? Best, Barry Caplan At 10:46 AM 1/21/2002 -0800, you wrote: -Original Message- From: King of kids [mailto:[EMAIL PROTECTED]] Sent: Saturday, January 19, 2002 1:55 AM To: [EMAIL PROTECTED] Subject: Please help me Dear Sir/Madam, Recently, I have heard of that all the Uighur (also called Uyghur, which is more standard in Uyghur Langauage) language letters are already in the Unicode Standard 3.1. I have seen all the Uyghur letters in: 1. http://www.unicode.org/charts/PDF/U0600.pdf 2. http://www.unicode.org/charts/PDF/U0600.pdf 3. http://www.unicode.org/charts/PDF/UFE70.pdf But, I could not find some of them within any font sets of Windows98/XP/2000. Could you tell me where can I find a font set (ex:Like Lucida Sans Unicode) in which I can find The Unicode Standard 3.1's Uyghur letters?(A font that contains all codes points within The Unicode Standard 3.1.) Regards, An Uyghur in Xinjiang Uyghur's Autonomous Region, PRC 99' Graduage Student Computer Department, Xinjiang University Waris Abdukerim * I would like to remind you that some Uighur(Uyghur) letters were not available in The Unicode Standard 3.0, but I found all of them in The Unicode Standard 3.1. Thanks very much.
Re: Devanagari
At 10:44 PM 1/20/2002 -0500, you wrote: Taking the extra links into account the sizes are: English: 10.4 Kb Devanagari: 15.0 Kb Thus the Dev. page is 1.44 times the Eng. page. For sites providing archives of documents/manuscripts (in plain text) in Devanagari, this factor could be as high as approx. 3 using UTF-8 and around 1 using ISCII. Yes, but that is this page only. Are you suggesting that all pages will vary by that factor? Of course not. Please consider whether the space *in practice* is a limiting factor. It seems that folks on the list feel it is not. Not for bandwidth limited applications, and not for disk space limited applications. The amount of space devoted to plain text of any language on a typical web page is microscopic compared tot he markup, images, sounds, and other files also associated with the web page. Are you suggesting that utf-8 ought to have been optimized for Devanagari text? Barry Caplan www.i18n.com -- coming soon...
Re: The benefit of a symbol for 2 pi
At 10:06 AM 1/18/2002 -0700, Robert Palais wrote: Which seems to make Unicode a defender of the status quo. Inaction is as political as action. We are holders of the standards for the technology for encoding symbols, and we won't admit new symbols until they are widely used... not necessarily the intent, but possibly the impact - that evolution of symbolic communication will be hampered? I think anyone is free to have other competing standards, and there have been other strong ones during the lifecycle of Unicode (ISO 10646 for instance). No one doubts that there are other characters that would be useful to encode. But the original concept of unicode as a 2 byte encoding leaves 64K code points. Unicode as a group quickly found out that was not enough to make everyone happy. As it is, the standard is rife with kluges in the encoding scheme. The limitation of characters to those that are in current use is related in large part to the code point limitations and partially to the desire to prioritize work. It takes the same amount of work to add a character or group of characters regardless of whether or not those characters will be used. there are plenty of characters which exist in the literature that are not ended in Unicode, and in fact are specifically excluded: those of written but dead languages. Newly proposed characters at least have a process: get them in use and addition to Unicode will be easy. In your case, one way to go about that may be to build a (probably pretty straightforward) script that searches out instances of 2pi in tex and word files, etc., and replaces them with newpi references. Create a font which has this character (maybe where the pi is now, or as a user defined char?). Make it easy for folks to get and use these tools. Soon there either will or will not be a substantial body of literature using newpi instead of pi, and a large discussion of why and how its adoption in math texts should happen. Once that is in place, I do not think you will be disappointed by the Unicode group. Right now newpi seems like a meme that is likely to die to the Unicode folks. Show otherwise, and life will be easy, as it was for the euro proponents. Best, Barry Caplan www.i18n.com -- coming soon, sign up for features and launch announcements
Re: The benefit of a symbol for 2 pi
At 01:45 PM 1/18/2002 -0500, you wrote: The limitation of characters to those that are in current use is related in large part to the code point limitations What limitations? We have over a million codepoints to play with. There is plenty of room. I've always been under the impression that one of the original goals of the Unicode effort was to do away with he sort of multi-width encodings we are all too familiar with (EUC, JIS, SJIS, etc.). this was to be accomplished by using a fixed width encoding. In my mind, everything other than that in order to increase space (but not necessarily to save bandwidth) is a kluge, and a compromise, because it means code still has to be aware of the details of the the encoding scheme. I do not dispute that with the kluges/compromises, there is plenty of room. There are plenty of characters which exist in the literature that are not ended in Unicode, and in fact are specifically excluded: those of written but dead languages. They are not only not excluded, they are included: Runic and Deseret are just the beginning. There are many pending proposals for things like hieroglyphs and cuneiform. Now that there are kluges that allow for extra room. But wasn't it not always the case historically speaking that these languages were, shall we say, less than welcome?
Re: The benefit of a symbol for 2 pi
At 11:33 AM 1/16/2002 -0700, Robert Palais wrote: is at the same time somewhat a Catch-22. Nelson Beebe recommended it since he figured unicode 3.2 would be the make or break for getting it in use. I'd be curious if you disagree with the thesis that a symbol for 6.28 has scientific/mathematical merit (in comparison 3.14...), and if so why? My guess is that since pi is the ratio of the circumference to the diameter, that the diameter is a more natural conception of the size of a circle than the radius. Of course mathematically, it doesn't matter other than the factor of 2. But other geometrical shapes, particularly polygons, are measured by line segments that extend from one point to another on the same shape, or series of shapes. A radius just sort of ends in the middle, while a diameter or other chord begins and ends on the circle. I can't quote the history, but if I imagine back to the Greek days, I bet the diameter was the primary measure. Other polygonal shapes with which they were familiar had their measures in terms of a line segment crossing the entire shape and touching the boundaries, or coincident with the boundary. Mathematicians pondering the circle for the first time, there probably was no reason to think otherwise. How to proceed from there to figure the area of a circle or the ratio of the diameter to the circumference were probably some of the greatest challenges of the day. They wanted to know the circumference and area, same as they had calculated for other shapes. I would guess that since pi is the ratio of the circumference and diameter, that this problem was solved first. Had it been the other way around, our formulas might look the way Dr. Palais suggests. Now that I think about it, I wonder if the very concept for radius grew out of the solution to the area of the circle: was the original formula A = pi * (d over 2)squared? If so, then maybe a conceptual leap was made to simplify it, thus inventing the radius. Why simplify the d/2 part and not the other way (pi/4)? Probably because pi is just a number, while d/2 turned out to have some connection to the physical world - the distance from the edge of a circle to the center. But this is just idle lunchtime speculation on my part. Note that using the new symbol the circumferance of a circle is simply tri*r, but the Area changes form pi*r(squared) to tri *(1/2) times r squared, so you lose as much as you gain it seems to me. Barry Caplan
Re: Question
Can you describe the nature of the script and how it uses Unicode (if at all) or what it uses for text processing. What version of Unicode are you using now for your data? Best regards, Barry Caplan At 05:15 PM 1/15/2002 -0800, BBCOA Webmaster wrote: Hello. I am looking for help with Unicode. I was recently told by my credit card processing company that I need to Upgrade my site to unicode 3.2 in order to get a perl script working. I was wondering how I might be able to do this. I have no idea how to install or find the lastest version of unicode. Gustavo A. Higuera BBCOA Webmaster 818-757-7123 ext 222