Re: Is there Unicode mail out there?
In a message dated 2001-07-11 15:03:27 Pacific Daylight Time, [EMAIL PROTECTED] writes: One exception to this should be US-ASCII because not only the repertoire of US-ASCII is a subset of the repertoire of UTF-8 but also the representation of all characters in US-ASCII is identical in UTF-8. A smart mail client would notice that all characters are in US-ASCII repertoire and label outgoing messages as in US-ASCII EVEN if it's configured to label outgoing messages in UTF-8 [...] I thought this might even be enshrined in an RFC. It certainly makes sense. If you are using a mailer that sends CP1252 down the wire (not that this is a good idea, but some mailers do this), the mailer should examine the message and if it only contains US-ASCII characters, the message should be tagged as US-ASCII. Otherwise, if it only contains ISO 8859-1, it should be tagged as ISO 8859-1. Only if it actually contains CP1252 characters, like smart quotes or long dashes, should it be tagged as CP1252. As Jungshik observed, the same goes for UTF-8. -Doug Ewell Fullerton, California
Re: Is there Unicode mail out there?
Please disregard my previous message about a work-around for Outlook Express problem. Although it works, non-UTF-8 messages are no longer being properly displayed, an unacceptable trade-off. Another possibility which was tested was to add an innocuous character which isn't included in any code page to the signature. Tried the zero-width space. When copying the zero-width space into the signature of a message being sent in reply to a message encoded as Thai (Windows), Outlook Express prompted to Send as Unicode... when the letter was tagged to be sent later. So far, so good. Figured it would be possible to set up a signature with ZWS to eliminate the necessity of manually changing the encoding of messages being sent to UTF-8 every time a message is sent. Unfortunately, on Windows M.E., the signature information is stored in the Registry, and it's ASCII. So, the ZWS got converted to a question mark and doesn't get switched back when it's added to a message. So, tried setting up a signature file to be added to each outgoing message including the ZWS. In this case, MSOE displays the UTF-8 ZWS as mojibake (gibberish) when the signature is added to the outgoing message. Perhaps a future version of Outlook will correct the problem. Best regards, James Kass.
Re: Is there Unicode mail out there?
[EMAIL PROTECTED] wrote: In a message dated 2001-07-11 15:03:27 Pacific Daylight Time, [EMAIL PROTECTED] writes: One exception to this should be US-ASCII because not only the repertoire of US-ASCII is a subset of the repertoire of UTF-8 but also the representation of all characters in US-ASCII is identical in UTF-8. A smart mail client would notice that all characters are in US-ASCII repertoire and label outgoing messages as in US-ASCII EVEN if it's configured to label outgoing messages in UTF-8 [...] I thought this might even be enshrined in an RFC. It certainly makes sense. If you are using a mailer that sends CP1252 down the wire (not that this is a good idea, but some mailers do this), the mailer should examine the message and if it only contains US-ASCII characters, the message should be tagged as US-ASCII. The RFCs/BCPs do encourage using as minimal a charset as possible. Anyway, UTF-8 email is nowhere right now. Kat Momoi of Netscape has suggested that about the only this could change is if email client vendors turn it on by default in new product releases. I won't be the first! Having done a lot of email client programming using the RFCs as a basis, let me say that in general RFCs are vague, and not always the best practice for interoperability when it comes to email. For example, CRLF in message bodies is recommended, but actually reduces interoperability, particularly with subversions of IE 5. So I don't know of any email client that does it. And quoted-printable is way too complicated to expect conforming implementations. And don't get me started about all the random charsets that RFCs promote that nobody adopts! James.
Re: Is there Unicode mail out there?
Here's a work-around that seems to work. Added the ZWS after the signature in a signature file. Because the mojibake for ZWS includes the Euro currency symbol, OE prompts to 'send as Unicode' when replying to a non-UTF-8 sender. Of course, the time saved by not having to manually change the encoding will probably be less than the time lost explaining what the junk is under my name. Best regards, James Kass. ​
Re:
大家好!!! ← ふりがなください Is that like だいすき? No, だいすき is 大好き. Something like da jia hao in Mandarin but with appropriate Chinese tones. こんにちは in Japanese, I gather. Dunno about Mandarin, but it's Daaih Gā Hóu in Cantonese. Bit formal though ...How about 你哋點阿? ... just kidding : ) Michael
Re: Is there Unicode mail out there?
On Thu, 12 Jul 2001, James Kass wrote: Here's a work-around that seems to work. Added the ZWS after the signature in a signature file. Because the mojibake for ZWS includes the Euro currency symbol, OE prompts to 'send as Unicode' when replying to a non-UTF-8 sender. Mysterious is why this prompting (by MS OE) did not happen to Mike Ayers when he replied to Peter's message with Thai string in Windows-874 adding some Chinese characters while MS OE (5.50.x) I tried certainly prompted me to pick one of three (1. send as Unicode, 2. send as is - in Windows-874 - risking loss of info. 3. cancel) when I did the same thing. ZWS and Chinese characters have no reason to be treated differently when added to a Windows-874 encoded message. BTW, Mozilla/Netscape 6 also uses the encoding of the message (or its closest match among IANA-registered MIME charsets. Thus, in place of Windows-874, Mozilla/Netscape 6 uses TIS-620) you're replying to by default. When one adds some characters outside the repertoire of that encoding, it warns that there are some characters not representable in the current encoding and it's necessary to change the encoding to something that can represent all characters. (it does not suggest Unicode.) It offers two options : go ahead despite potential loss of some characters or cancel and change the encoding. Perhaps, both Mozilla/Netscape 6 and MS OE should have an option ( 'toggle-switchable') to let users specify that their preferred encoding (set in preference) be used by default regardless of the encoding of messages they're replying to. Jungshik Shin
Re: Is there Unicode mail out there?
Hmm, it didn't work either. OK, one more try -- Thai test, take 3: กลัปมาอยู่แล้ว - Peter --- Peter Constable Non-Roman Script Initiative, SIL International 7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA Tel: +1 972 708 7485 E-mail: [EMAIL PROTECTED]
Re: $B$U$j$,$J$/$@$5$$(B
'$BBg2H9%(B!!!' is the most common chinese greeting! '$BBg2H(B' means 'EVERYBODY', '$B9%(B' means 'GOOD', $BBg2H9%(B!!! means 'HOPE EVERYBODY'S DOING GOOD' It is something like '$B3'$5$s!"$45!7y$$$+$,$G$7$g$&$+!G(Bin Japanese. '$B
Re: A UTF-8 based News Service
In a message dated 2001-07-12 8:27:20 Pacific Daylight Time, [EMAIL PROTECTED] writes: As someone involved in the service I often wish there was some form of compressed Unicode encoding. The 3-byte penalty that Ethiopic bears under UTF-8 turns into higher bandwidth that web hosting services meter and charge for by the megabyte. For a popular site this soon makes UTF-8 a costly option to support. A system analagous to iso-8859-x whereby Ethiopic and other scripts in the 3 byte range could be shifted back into the 2 byte range might help (generally only English and Ethiopic is desired together). Today is your lucky day. Check out Unicode Technical Standard #6, A Standard Compression Scheme for Unicode: http://www.unicode.org/unicode/reports/tr6/ SCSU uses 128-byte windows to compress small alphabetic scripts to almost 1 byte per character. Since Ethiopic occupies three 128-character half-blocks, SCSU must use three windows and switch between them, but the overhead is still much lower than UTF-8. In the worst case (each character belongs to a different half-block than the one before), you will still use only 2 bytes per character. SCSU is fully supported by SC UniPad, a Unicode text editor that is currently available for free. For more information, visit: http://www.unipad.org/ -Doug Ewell Fullerton, California
Re: Is there Unicode mail out there?
(I didnt read all the thread so maybe I missed a step). So the proposal is that minimizing the charset is a good thing? This means that you and I start out in a conversation about a product I am trying to sell you, it happens to be all in ascii and we exchange several mails successfully. Then I quote you a price in Euros and my 1252 message gets corrupted by your reader which can handle either only 8859-1 or ASCII, and you miss the fact that the Euro is corrupted and think we are talking dollars or some other currency. Although I understand why you would want a minimal charset in order to not needlessly prevent communications, the implication of reliability and trust that is built by having some success is a problem. You think you are communicating successfully but when it is critical it may not... Perhaps if a harder line was taken when characters are used that cannot be converted, this would make more sense. (ie give a very clear recognizable indication of corruption or conversion failures) tex [EMAIL PROTECTED] wrote: In a message dated 2001-07-11 15:03:27 Pacific Daylight Time, [EMAIL PROTECTED] writes: One exception to this should be US-ASCII because not only the repertoire of US-ASCII is a subset of the repertoire of UTF-8 but also the representation of all characters in US-ASCII is identical in UTF-8. A smart mail client would notice that all characters are in US-ASCII repertoire and label outgoing messages as in US-ASCII EVEN if it's configured to label outgoing messages in UTF-8 [...] I thought this might even be enshrined in an RFC. It certainly makes sense. If you are using a mailer that sends CP1252 down the wire (not that this is a good idea, but some mailers do this), the mailer should examine the message and if it only contains US-ASCII characters, the message should be tagged as US-ASCII. Otherwise, if it only contains ISO 8859-1, it should be tagged as ISO 8859-1. Only if it actually contains CP1252 characters, like smart quotes or long dashes, should it be tagged as CP1252. As Jungshik observed, the same goes for UTF-8. -Doug Ewell Fullerton, California -- --- Tex Texin Director, International Business mailto:[EMAIL PROTECTED] +1-781-280-4271 Fax:+1-781-280-4655 the Progress Company 14 Oak Park, Bedford, MA 01730 ---
A UTF-8 based News Service
Greeings, I thought this would be of interest to people here who might be involved in multilingual news services: The Ethiopian News Headlines has relocated to a new server at http://www.ethiozena.net/ and is making it easier than ever to read news headlines in Unicode. A companion Unicode only server is launched at http://unicode.ethiozena.net/ which serves articles in UTF-8 encoding only. Other new features include localization in three languages and daily article links are packaged in XML for other news services to link to (see http://www.ethiozena.net/zena.xml and a demonstration parsing script in Perl http://www.ethiozena.net/zena.pl.txt). As someone involved in the service I often wish there was some form of compressed Unicode encoding. The 3-byte penalty that Ethiopic bears under UTF-8 turns into higher bandwidth that web hosting services meter and charge for by the megabyte. For a popular site this soon makes UTF-8 a costly option to support. A system analagous to iso-8859-x whereby Ethiopic and other scripts in the 3 byte range could be shifted back into the 2 byte range might help (generally only English and Ethiopic is desired together). Fortunately there is mod_gzip for Apache. I would appreciate any information about other options. thanks, /Daniel
More about SCSU (was: Re: A UTF-8 based News Service)
I should have also mentioned that SCSU is fully supported by the programming toolkit ICU (International Components for Unicode), found at: http://oss.software.ibm.com/icu/ An Open Source project, ICU is available for free and comes with voluminous documentation. SCSU is also registered as an IANA charset, although you are unlikely to find raw SCSU text on the Internet, due to its use of control characters (bytes below 0x20). Hope this helps. -Doug Ewell Fullerton, California
Re: Is there Unicode mail out there?
My other e-mail was a real "moji-baka", I'd say. That would be a good term, $BJ8;zGO(B: Re: Is there Unicode mail out there? (I didnt read all the thread so maybe I missed a step). So the proposal is that minimizing the charset is a good thing? This means that you and I start out in a conversation about a product I am trying to sell you, it happens to be all in ascii and we exchange several mails successfully. Then I quote you a price in Euros and my 1252 message gets corrupted by your reader which can handle either only 8859-1 or ASCII, and you miss the fact that the Euro is corrupted and think we are talking dollars or some other currency. Although I understand why you would want a minimal charset in order to not needlessly prevent communications, the implication of reliability and trust that is built by having some success is a problem. You think you are communicating successfully but when it is critical it may not... Perhaps if a harder line was taken when characters are used that cannot be converted, this would make more sense. (ie give a very clear recognizable indication of corruption or conversion failures) tex [EMAIL PROTECTED] wrote: In a message dated 2001-07-11 15:03:27 Pacific Daylight Time, [EMAIL PROTECTED] writes: One exception to this should be US-ASCII because not only the repertoire of US-ASCII is a subset of the repertoire of UTF-8 but also the representation of all characters in US-ASCII is identical in UTF-8. A smart mail client would notice that all characters are in US-ASCII repertoire and label outgoing messages as in US-ASCII EVEN if it's configured to label outgoing messages in UTF-8 [...] I thought this might even be enshrined in an RFC. It certainly makes sense. If you are using a mailer that sends CP1252 down the wire (not that this is a good idea, but some mailers do this), the mailer should examine the message and if it only contains US-ASCII characters, the message should be tagged as US-ASCII. Otherwise, if it only contains ISO 8859-1, it should be tagged as ISO 8859-1. Only if it actually contains CP1252 characters, like smart quotes or long dashes, should it be tagged as CP1252. As Jungshik observed, the same goes for UTF-8. -Doug Ewell Fullerton, California -- --- Tex Texin Director, International Business mailto:[EMAIL PROTECTED] +1-781-280-4271 Fax:+1-781-280-4655 the Progress Company 14 Oak Park, Bedford, MA 01730 ---
Re: Is there Unicode mail out there?
On Thu, 12 Jul 2001 [EMAIL PROTECTED] wrote: Hmm, it didn't work either. OK, one more try -- Thai test, take 3: กลัปมาอยู่แล้ว Finally, you succeeded ! Congratulations :-). Could you explain what you did differently this time so that other Lotus Notes users can benefit from your experience/experiment? Jungshik Shin
RE: Is there Unicode mail out there?
From: Jungshik Shin [mailto:[EMAIL PROTECTED]] Mysterious is why this prompting (by MS OE) did not happen to Mike Ayers when he replied to Peter's message with Thai string in Windows-874 adding some Chinese characters while MS OE (5.50.x) I tried certainly prompted me to pick one of three (1. send as Unicode, 2. send as is - in Windows-874 - risking loss of info. 3. cancel) when I did the same thing. ZWS and Chinese characters have no reason to be treated differently when added to a Windows-874 encoded message. Not mysterious really, I'm using Outlook, not Outlook Express. Despite the similarity of names, the differences seem to be considerable. It is disturbing, though, that the premium product has less desireable behavior than the free one in this case. /|/|ike
Re: Is there Unicode mail out there?
Jungshik Shin wrote: Perhaps, both Mozilla/Netscape 6 and MS OE should have an option ( 'toggle-switchable') to let users specify that their preferred encoding (set in preference) be used by default regardless of the encoding of messages they're replying to. It would be nice... MS OE appeared to already have the option. Under Tools-Options- Send, there's a check-box for Reply to messages using the format in which they were sent. Under Tools-Options-Send-International Settings, there's a provision for the user to choose a default encoding and a check-box to Use the following default encoding for outgoing messages:. Even though this system was set up accordingly, outgoing messages which were replies to messages in non-UTF-8 encodings weren't being sent in UTF-8, to my surprise, chagrin, and dismay. Best regards, James Kass. ​
RE: Is there Unicode mail out there?
In any case, no matter if new message or reply or forward, you can force OE to use a specific encoding using the Format.Encoding menu. There is no option to ALWAYS use a specific encoding in replies and forwards, you will have to choose manually each time. OE itself has no option to automatically determine the best outbound encoding (and I agree that generally the encoding with the smallest repertoire is the best). OE will only suggest UTF-8 and will not suggest any other charset, if the chosen encoding does not hold the characters used. Note: an HTML message to an HTML4 capable recipient will transport any character regardless of the chosen encoding. That might explain the different results you are seeing when sending to differently enabled recipients. Replying in the charset of the original message is in my view reasonable behavior: the recipient of your reply has the best chance to read the message in the encoding the original message was sent. Changing the encoding decreases the chance the replyee will be able to read your message. -Original Message- From: James Kass [mailto:[EMAIL PROTECTED]] Sent: Thursday, July 12, 2001 1:18 PM To: Jungshik Shin Cc: Unicode List Subject: Re: Is there Unicode mail out there? Jungshik Shin wrote: Perhaps, both Mozilla/Netscape 6 and MS OE should have an option ( 'toggle-switchable') to let users specify that their preferred encoding (set in preference) be used by default regardless of the encoding of messages they're replying to. It would be nice... MS OE appeared to already have the option. Under Tools-Options- Send, there's a check-box for Reply to messages using the format in which they were sent. Under Tools-Options-Send-International Settings, there's a provision for the user to choose a default encoding and a check-box to Use the following default encoding for outgoing messages:. Even though this system was set up accordingly, outgoing messages which were replies to messages in non-UTF-8 encodings weren't being sent in UTF-8, to my surprise, chagrin, and dismay. Best regards, James Kass.
RE: Is there Unicode mail out there?
From: Chris Wendt [mailto:[EMAIL PROTECTED]] Replying in the charset of the original message is in my view reasonable behavior: the recipient of your reply has the best chance to read the message in the encoding the original message was sent. Changing the encoding decreases the chance the replyee will be able to read your message. For person-to-person emails, this makes sense. It does not hold up for mailing lists, however - it's not necessarily unreasonable behavior, but the odds of readability for mailing lists are fixed to the character set, regardless of the character set used in any individual mailing (note that the Windows Thai character set could not be viewed by many people - changed to UTF-8, almost everyone could read it). For this reason, I would really like to see option controlled behavior (use the current behavior as a default). /|/|ike
Re: Is there Unicode mail out there?
Chris Wendt wrote: Replying in the charset of the original message is in my view reasonable behavior: the recipient of your reply has the best chance to read the message in the encoding the original message was sent. Changing the encoding decreases the chance the replyee will be able to read your message. When a user issues an instruction to a computer, it is a command rather than a request. If a user selects the option to Use the following default encoding for outgoing messages:, then the expected behavior is compliance. Of course, you are quite right in that the recipient is more likely to be able to read a message sent in the recipient's default. As we move towards a World encoding standard, perhaps more applications will use the standard as default. This message is being sent in Arabic (Windows) because it is in reponse to a message sent in that encoding. The author of the original message has noted my work-around and has cleverly prevented it by selecting a code-page which includes the special character I'm using for the kludge. Best regards, James Kass.
Re: Wordprocessors in Korean
At 06:50 AM 2001-07-10, Genenz wrote: ...Now a teacher from Korea told me, MS Word has some shortcomings concerning Korean and there would exist another word- processor much more frequently used than Word2000 in his country. (He also complained about win2000 their are better choices for multilanguage apps and the net, but that is another story not to be discussed here). Word 2000 on Windows 2000 supports Korean well enough for my needs, but I am not an authority on what Koreans want in a word processor. Can you get an explanation for us of its shortcomings? As for better multilingual apps, indeed there are some that support more languages, but I have not found any fully multilingual products for Mac or PC capable enough and reliable enough for my needs. Not that Word is totally reliable, of course, but I manage. Edward Cherlin Generalist A knot! Oh, do let me help to undo it. Alice in Wonderland
Re: A UTF-8 based News Service
As someone involved in the service I often wish there was some form of compressed Unicode encoding. The 3-byte penalty that Ethiopic bears under UTF-8 turns into higher bandwidth that web hosting services meter and charge for by the megabyte. For a popular site this soon makes UTF-8 a costly option to support. A system analagous to iso-8859-x whereby Ethiopic and other scripts in the 3 byte range could be shifted back into the 2 byte range might help (generally only English and Ethiopic is desired together). Fortunately there is mod_gzip for Apache. I would appreciate any information about other options. What about UTF-16? Encode all characters as 2 bytes, and your problem is solved, and UTF-16 should be supported by all recent Unicode-supporting web browsers. -- David Starner - [EMAIL PROTECTED]