[unicode] Mail loop at China.com
Received: from china.com (TCE-E-7-182-12.bta.net.cn [202.106.182.12]) by unicode.org (8.9.3/8.9.3) with SMTP id AAA03398 for [EMAIL PROTECTED]; Fri, 23 Mar 2001 00:11:40 -0500 Received: from china.com([10.1.7.101]) by china.com(JetMail 2.5.3.0) with SMTP id jm93abafcca; Fri, 23 Mar 2001 06:06:05 - Received: from unicode.org([209.235.17.55]) by china.com(JetMail 2.5.3.0) with SMTP id jm533aba555b; Thu, 22 Mar 2001 14:58:28 - China.com seems to have a mail loop. If recent experience of the same problem on another list is anything to go by, there's little hope of them fixing the problem. Is there a function in LISTAR to precent duplicate messages getting through? If not, could Sarasvati in her wisdom please track down the offending subscriber(s) at china.com (or a domain hosted by them) and unsubscribe them... John. -- -- Over 1500 webcams from ski resorts around the world - http://www.snoweye.com/ -- Translate your technical documents and web pages- http://www.tradoc.fr/en/
[unicode] Malay (Latin) characters in Unicode?
[Feed another to the shubnet . . .] I have a copy of Shellbear's Practical Malay Grammar that I'm preparing to transcribe for Project Gutenberg. Unfortunately, he represents the Malaysian alphabet in a Latin transliteration that includes ng as a single ligatured form, and I don't know how to transcribe in Unicode. Some ideas: (1) Use a private use character. Not feasible, because it needs to readable by the average person, not just someone who has patience to set up their computer for this one file. (2) Use a ZWJ between n and g. If I'm not mistaken, most current systems will show the ZWJ as a little black box, and there's going to be very few systems any time soon that would actually display the ng ligature. Still, a good Unicode system will elide the ZWJ displaying the acceptable ng with the real information still in the file. (3) Petition Unicode for a new character. Right. I'm going to argue for a character used in two books (that I know of) that bears annoying similarity to the ng (non-ligatured) flame wars, that in the best of cases I wait a couple years for it to be accepted. (4) Resort to ASCII trickery to distinguish between ng (ligatured) and ng (non-ligatured). Marking the ng (ligatured) would be ugly; marking the unligatured would be also ugly, although a lot rarer - I don't know if Malay (in this transliteration) uses ng (non-ligatured). (5) Just use ng. A simple, just ASCII solution. I don't know if it's information preserving though. Any suggestions? -- David Starner - [EMAIL PROTECTED] Gutenberg stuff - http://dvdeug.dhis.org/guten/ (down for the week) Free, encrypted, secure Web-based email at www.hushmail.com
[unicode] Re: removing compromises from unicode (WCode)
[Hoping the shubnet doesn't got this one too . . .] WTF-8 could potentially be as compact or more compact than UTF-8 (for Greek, Arabic ...), since much of the Latin-1 and Latin Extended A blocks aren't needed in WCode. If you moved the other characters down to fill that space, you might win what you lost to C1 compatibilty. I've considered writing up my own WCode (just for the heck of it) before. My big fix would be losing ASCII compatibility(!), which allows us to remove redundant and ill-defined controls and characters (ASCII apostraphe! CF-LF!). Move the basic set of controls (LS, PS, ZWJ, etc.) and the basic set of script-neutral punctionation and characters (.,:;?!; possibly the Indo-European (Arabic?) digits 0-9) into the bottom 128, followed by the combinging characters and then the decomposed Latin and so on. Losing ASCII compatibilty is much more radical than you've proposed, though. -- David Starner - [EMAIL PROTECTED] Pointless (and temporaily down) webpage: http://dvdeug.dhis.org Free, encrypted, secure Web-based email at www.hushmail.com
[unicode] Re: Poll of the day
At 20:56 -0500 2001-03-22, Sarasvati wrote: Here by popular demand is the poll of the day... http://www.unicode.org/~sarasvati/poll.html Not Found The requested URL /~sarasvati/Democratic-Process was not found on this server. Apache/1.3.14 Server at www.unicode.org Port 80
[unicode] Re: Helpful info
Sarasvati wrote: TO FIND OUT WHO IS ON A LIST: Send a message to the listar account on the server with a subject of "who [listname]". You will receive a list of people subscribed to the list who are not hidden. Admins will be able to see everbody, including those hidden. O, thank you! This is a great great utility for spammers! A wise postmaster --clearly not the case of Sarasvati-- would have set all subscribers to "hidden", before enabling this command. Probably, by now, all our addresses have already been harvested and archived by all sellers of virility extenders, TV cables descramblers, home working schemes, hoaxes, etc. So now they won't need to post to the Unicode list to reach us. The next thing to do for all of us is to close our Internet mail accounts and open new ones. The easiest way to do this for people like me, who imprudently subscribed their business or academic addresses, is to resign from their job or university. The only good thing is that also employer can now bypass the (absurd) prohibition of sending recruitment postings -- so we'll have an opportunity of finding another job. Oh, by the way, and you can bypass any other prohibition. So, people who need to post GIFs or lengthy quotes now know how to do it: just download the list and paste it in your "To" field. Brilliant move, Sarasvati. _ Marco
[unicode] Re: Moving mail lists
Ar 23 Mar 2001, ag 1:44 scrobh Sarasvati fn bhar "Re: Moving mail lists": At this moment, there are 691 addresses subscribed to the Unicode mail list. At least 24 of those entities are points of further fan-out to local lists elsewhere. If you can gather a list of at least 346 current subscribers who respond affirmatively to the question "should Sarasvati remove the [unicode] tag in the subject header?", then let it be considered that popular opinion is in your favor, and the tag will be removed. Notwithstanding that (a) I still think this is a stupid, unnecessary and pernicious "innovation" and as such should not be considered because of it's inconvenience to those who don't require it and (b) no such act of democracy was required to *institute* this unnecessary change I'll take the above private response as a request for everyone opposed to or in favour of this piece of nonsense to e-mail Sarasvati [EMAIL PROTECTED] with your opinion. I trust Sarasvati will keep us apprised of the tally, the proportion of non-voters, etc., etc. and is prepared to arrange for the neccessary hand-counts and legal actions that will ensue. `~:.,.:'^`~:.,.:'^`~:.,.:'^`~:.,.:'^`~:.,.:'^`~:.,.:'^`~:.,.:'^`~:.,.:'^`~ S e n S a g h d h a [EMAIL PROTECTED] There is only one thing in the world worse than being talked about, and that is not being talked about. Oscar Wilde.
[unicode] Re: UCS-2 Files
Am 2001-03-22 um 14:31 h MEZ hat Tomas McGuinness geschrieben: I am currently developing a product that will support UCS-2 For a new project, it would be better to support UTF-16, rather than UCS-2, from the very beginning. There are already characters accepted for standardization that can not be encoded in UCS-2. Cf. - http://www.unicode.org/unicode/faq/, - http://www.unicode.org/unicode/faq/utf_bom.html#5, - http://www.unicode.org/unicode/alloc/Pipeline.html#Characters and Scripts Accepted for Unicode. Best wishes, Otto Stolz
[unicode] Re: Moving mail lists
On Fri, 23 Mar 2001, Sean O Seaghdha wrote: I'll take the above private response as a request for everyone opposed to or in favour of this piece of nonsense to e-mail Sarasvati [EMAIL PROTECTED] with your opinion. I trust Sarasvati will keep us apprised of the tally, the proportion of non-voters, etc., etc. and is prepared to arrange for the neccessary hand-counts and legal actions that will ensue. I don't think so. The kind of reply you had received simply means that Saravasti will not change it, without even being sorry for people like us who consider it a pain :( --roozbeh
[unicode] Re: bytes bits
Touché by all of you who've corrected my reliance on dictionaries for tech definitions.
[unicode] Re: UCS-2 Files
Jeff, A byte is the least addressable portion of memory. The IBM 1401 for example has 6 bit bytes + a word mark. Parity bits don't count. A lot of systems in the 50's and early 60's had 6 bit bytes. That is why octal became so popular. Bytes were not used for systems like the IBM 1620 which was a scientific system. Memory was an array of number registers and was not character based. Instead the least addressable memory unit was a word. A byte may be 8 bits now but it was not always 8 bits. Carl -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On Behalf Of Jeff Guevin Sent: Thursday, March 22, 2001 12:01 PM To: [EMAIL PROTECTED] Subject: [unicode] Re: UCS-2 Files On Thu, 22 Mar 2001, [EMAIL PROTECTED] wrote: Better if you also keep the distinction between "octet" (a series of 8 bits) and "byte" (a series of n bits, where n is often but NOT always 8). When is a byte not eight bits? The Web version of the Oxford English Dictionary (http://dictionary.oed.com) says a byte is always eight bits: "A group of eight consecutive bits operated on as a unit in a computer." 1964 BLAAUW BROOKS in IBM Systems Jrnl. III. 122 An 8-bit unit of information is fundamental to most of the formats [of the System/360]. A consecutive group of n such units constitutes a field of length n. Fixed-length fields of length one, two, four, and eight are termed bytes, halfwords, words, and double words respectively. 1964 IBM Jrnl. Res. Developm. VIII. 97/1 When a byte of data appears from an I/O device, the CPU is seized, dumped, used and restored. 1967 P. A. STARK Digital Computer Programming xix. 351 The normal operations in fixed point are done on four bytes at a time. 1968 Dataweek 24 Jan. 1/1 Tape reading and writing is at from 34,160 to 192,000 bytes per second. -- Gaute Strokkeneshttp://www.srcf.ucam.org/~gs234/ PEGGY FLEMING is stealing BASKET BALLS to feed the babies in VERMONT.
[unicode] Re: UCS-2 Files
Marco, I find that people often understand it better when you get away from bytes, octets etc. and describe Unicode strings as an array of unsigned short (16 bit unsigned integers) in the same manner as single byte characters are an array of 8 bit integers. This way the only time you have to deal with endian issues is when you deal with the memory or transmission layout of the data. This also helps when you get into null terminated strings. You can not terminate a Unicode string with a byte null, it has to be a full 16 bit character. Carl -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On Behalf Of Marco Cimarosti Sent: Thursday, March 22, 2001 7:03 AM To: 'Tomas McGuinness'; [EMAIL PROTECTED] Subject: [unicode] Re: UCS-2 Files Tomas McGuinness wrote: I have a question relating to UCS-2. I am currently developing a product that will support UCS-2 and I have been sent several documents encoded in UCS-2. I have no reader or writer for UCS-2 but I have performed Hexdumps in UNIX. At the beginning of the UCS-2 characters there are two rogue characters 0xFF and 0xFE. Have these characters any importance? They are quite important, yes. See http://www.unicode.org/unicode/faq/utf_bom.html#24 for details. But, beware that they are NOT characters: they are OCTETS (also known as "bytes")! The first thing that I'd suggest you to do when starting working with Unicode and other character sets is to carefully disjoining the terms "byte" and "character". Better if you also keep the distinction between "octet" (a series of 8 bits) and "byte" (a series of n bits, where n is often but NOT always 8). In brief, those two octets tell you that: 1. It is an Unicode text file. 2. It is in format UCS-2, UTF-16, or UTF-32 (to determine whether it is UTF-32 you need to read the next two octets: if they are 0x00 0x00, then it is UTF-32. Else it is either UCS-2 or UTF-16, which basically you don't need to distinguish). 3. The 16-bit units are little endian, so you have to interpret these two octets as (0xFF + 0xFE * 256), which yields 0xFEFF, the code of the "BOM". 4. All subsequent pairs of octets a,b are interpreted the same way: (a + b * 256). Regards. _ Marco
[unicode] CJK dictionary
Do you know CJK (kandji) dictionary with unicode codings? (This range's english meaning: 4E00CJK Ideograph, First 9FA5CJK Ideograph, Last F900CJK Compatibility Ideograph, First FA2DCJK Compatibility Ideograph, Last) Thanks. Gza - Rozsa Geza 432-8279, (30)996-0007; SMSre a subject: [EMAIL PROTECTED]
[unicode] Re: UCS-2 Files
A byte may be 8 bits now but it was not always 8 bits. Au contraire! It was the designers of System/360 who invented the word "byte" to mean the smallest addressable unit of storage, in their case 8 bits. It is others who have appropriated the word for their own purposes, as has happened with so many words since language was invented. Remember Humpty Dumpty! Mike.
[unicode] Re: Poll of the day
At 09:21 -0800 2001-03-23, Michael \(michka\) Kaplan wrote: After all, no one at all is claiming incompetence on the part of our ever-vigilant Bubble Queen of the River Ganga, but some people are talking about how much they preferred the way our effervescent but bitwise conservative used to do things. If it ain't broke, don't fix it. I fail to see any utility in this newfangled "[unicode]" appendage. -- Michael Everson ** Everson Gunn Teoranta ** http://www.egt.ie 15 Port Chaeimhghein ochtarach; Baile tha Cliath 2; ire/Ireland Mob +353 86 807 9169 ** Fax +353 1 478 2597 ** Vox +353 1 478 2597 27 Pirc an Fhithlinn; Baile an Bhthair; Co. tha Cliath; ire
[unicode] Re: Helpful info
Sarasvati had wriiten: TO FIND OUT WHO IS ON A LIST... Marco had written: O, thank you! This is a great great utility for spammers! [EMAIL PROTECTED] wrote: Marco: you evidently missed this line in her message: The list of subscribers is not available. Because of this very sentence, I have tested the Who command, and guess what? On Fri, 23 Mar 2001 08:45:18 -0500 (EST), Listar [EMAIL PROTECTED] happily sent me a list of 686 subscribers to the Unicode list. Now I am a subscriber myself, and the Who command may well be re- stricted to subscribers. But even this restriction would not be safe, as the recently detected spamming technique shows. So, it would be a good idea to disable the Who command for the Unicode list. Best wishes, Otto Stolz
[unicode] Re: Helpful info
Marco, Thank you for your valuable and candid opinion. I appreciate your confidence in my intelligence. Let me point out, however, that I specifically wrote in my NOTES about the helpful info: The list of subscribers is not available. as Peter Constable has already pointed out. Cheery regards from your effervescent, -- Sarasvati
[unicode] Re: Poll of the day
Unfortunately, anyone who felt strongly about things could easily (and reasonably) take it another way. After all, no one at all is claiming incompetence on the part of our ever-vigilant Bubble Queen of the River Ganga, but some people are talking about how much they preferred the way our effervescent but bitwise conservative used to do things. Unfortunately, our cheery (and occasionally quite cheeky, as that "poll" clearly proved!) but effervescent Sarasvati has decided that the way things used to be is some way not preferred, and that the list server of our forward looking group must cater to the needs of older systems (which apparently were never working adequately up till now?). At least they still allow people to vote at UTC meetings. :-) michka - Original Message - From: "Carl W. Brown" [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Friday, March 23, 2001 8:13 AM Subject: [unicode] Re: Poll of the day Adam, I think that the poll was not arrogant but a little fun to break the tension. Carl -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On Behalf Of G. Adam Stanislav Sent: Friday, March 23, 2001 4:38 AM To: Michael Everson; [EMAIL PROTECTED] Cc: [EMAIL PROTECTED] Subject: [unicode] Re: Poll of the day At 09:48 23-03-2001 +, Michael Everson wrote: http://www.unicode.org/~sarasvati/poll.html Not Found The requested URL /~sarasvati/Democratic-Process was not found on this server. That was the whole point, I believe. Since there were only two choices, and both identical, there is no need to actually process the form. The poll offered what we used to call during the Communist era the Paradise Choice: God brought Eve to Adam and said: "Choose one." That Sarasvati wants to do things her way is fine by me. That she made me log into the Internet, launch my browser, just to get to the mockery of the poll.html and no Democratic-Process, is a slap in the face. The Consortium is as arrogant as ever. Adam --- Whiz Kid Technomagic - brand name computers for less. See http://www.whizkidtech.net/pcwarehouse/ for details.
[unicode] Re: Helpful info
Because of this very sentence, I have tested the Who command, and guess what? On Fri, 23 Mar 2001 08:45:18 -0500 (EST), Listar [EMAIL PROTECTED] happily sent me a list of 686 subscribers to the Unicode list. Paranoid, I just tried the same and got: SNIP List context changed to 'unicode' by following command. who unicode List membership is only viewable by list admins. Valid command was found in subject field, body won't be checked for further commands. --- Listar v1.0.0 - job execution complete. /SNIP Otto, could you try again, please? /|/|ike
[unicode] Re: What is Unicode?
Another web page, for your collective amusement: http://linguistics.berkeley.edu/~rscook/html/Unicode-tetralog.html
[unicode] Re: Moving mail lists
Ar 21 Mar 2001, ag 11:58 scrobh [EMAIL PROTECTED] fn bhar "[unicode] Re: Moving mail lists": Those whom can filter their mail also can alter the subject line easily with, for example, small perl script. Since this is so easy, could you send me one? `~:.,.:'^`~:.,.:'^`~:.,.:'^`~:.,.:'^`~:.,.:'^`~:.,.:'^`~:.,.:'^`~:.,.:'^`~ S e n S a g h d h a [EMAIL PROTECTED] The only way to get rid of a temptation is to yield to it.Oscar Wilde.
[unicode] Re: Poll of the day
At 08:13 23-03-2001 -0800, Carl W. Brown wrote: Adam, I think that the poll was not arrogant but a little fun to break the tension. I'd have probably been more amused had I not been logged off the Internet at the time I was reading the message announcing the poll: I logged on, loaded the browser, etc, just to participate in the poll. Adam
[unicode] Re: Moving mail lists
Those whom can filter their mail also can alter the subject line easily with, for example, small perl script. Sean Since this is so easy, could you send me one? % perl -ne 's/\[unicode\]// if (/^Subject:/);' messagefile - Mark Leisher Times are bad. Children no longer obey Computing Research Labtheir parents, and everyone is writing New Mexico State University a book. Box 30001, Dept. 3CRL-- Marcus Tullius Cicero Las Cruces, NM 88003
[unicode] Re: Moving mail lists
Oops. I missed a spot. % perl -ne 's/\[unicode\]// if (/^Subject:/); print;' messagefile - Mark Leisher Times are bad. Children no longer obey Computing Research Labtheir parents, and everyone is writing New Mexico State University a book. Box 30001, Dept. 3CRL-- Marcus Tullius Cicero Las Cruces, NM 88003
[unicode] Re: removing compromises from unicode (WCode)
From: "Jonathan Coxhead" [EMAIL PROTECTED] It would be very entertaining to do the same job with the ideographs (down to the radical level) and count the number of atoms. I suspect the resulting "character set" would contain less than 2000 atoms altogether. More than just entertaining, one would definitely find the space saved to be about 1000 times the work of the other decompositions. Addressing the non-CJK and ignoring the CJK is like fixing app performance at boot and ignoring the entire rest of the app's lifetime! MichKa Michael Kaplan Trigeminal Software, Inc. http://www.trigeminal.com/
[unicode] Re: removing compromises from unicode (WCode)
It would be very entertaining to do the same job with the ideographs (down to the radical level) and count the number of atoms. I suspect the resulting "character set" would contain less than 2000 atoms altogether. MichKa replied ... More than just entertaining, one would definitely find the space saved to be about 1000 times the work of the other decompositions. Addressing the non-CJK and ignoring the CJK is like fixing app performance at boot and ignoring the entire rest of the app's lifetime! "Space saved"? Good heavens, I never had anything as mundane as "disc space" (or "performance" for that matter) in mind. It was purely an exercise in concepts and relationships, with no other goal in mind. Fun, for some value of "fun" :-)
[unicode] Re: Malay (Latin) characters in Unicode?
At Fri, 23 Mar 2001 00:13:33 -0800, Rick McGowan [EMAIL PROTECTED] wrote: David Starner wrote: I have a copy of Shellbear's Practical Malay Grammar that I'm preparing to transcribe for Project Gutenberg. Unfortunately, he represents the Malaysian alphabet in a Latin transliteration that includes ng as a single ligatured form, and I don't know how to transcribe in Unicode. Could you perhaps post or point to a picture of what it looks like? I suppose it's an "N" with a loopy tail of some type. More like rg. A picture is attached. The character you are looking for is probably U+014B in lowercase or U+014A in uppercase. I would be rather surprised if that's not what you're looking for. It's not exactly what I was looking for. I may just use it and make a note that the glyph is probably not exactly right. BTW, a bit off topic here but: I think it's high time that Project Gutenberg adopted some very clear character encoding guidelines now that they're expanding so widely. Or have they already adopted them and I've just missed the policy statement...? They're in for a real mess if they don't specify character encodings in a very controlled way. At some points, they are already a real mess. You can dig through Gutenberg archives and find various (unlabeled) encodings for the Latin-1 coverage. There's at least one Japenese document that just says "you need a Japenese OS to read this." 8-bit documents are usually labeled as 8-bit, without any indication of encoding. OTOH, the policy of doing everything possible in ASCII has saved Gutenberg some problems. They're moving towards Unicode for any files that need it. The Bulgarian files are clearlly labeled windows-1251, which is at least as start. See ftp://metalab.unc.edu/pub/docs/books/gutenberg/GUTINDEX.02 and GUTINDEX.01 for recent examples. Most of the unmarked stuff is ASCII, but there's a number of clearly Unicode marked and "8-bit German" marked files. -- David Starner - [EMAIL PROTECTED] Free, encrypted, secure Web-based email at www.hushmail.com R_T_malay_ng.png
[unicode] Reading mojibake
I taught myself to read a bit of SJIS mojibake, partly from studying the scrambled output of my clock program with fullwidth digits. (glitch plus O = 0. Glitch plus P = 1, etc., I think) Anyone else here can read mojibake? What is the English word for mojibake? Isn't Unicode mojibake three mojibake per character, rather than just two like in SJIS? How do you fix this, anyway? Like if you have a lot of Unicode text, so you don't need the extra byte. Maybe 2 1/2 bytes per character would be good. I mean for the extra planes and all. You guys could make a script for this in 10 minutes. *** JUUICHIKETAJIN *** ___ Get your own FREE Bolt Onebox - FREE voicemail, email, and fax, all in one place - sign up at http://www.bolt.com