Re: Is there Unicode mail out there?
Mark Davis wrote: The quotation I have is from my college Greek textbook (sadly my fluency has reduced to essentially zero after all these years). Perhaps some Greeks on the list could say which is the more accurate formulation? Mark — πάντων μέτρον ἄνθρωπος — Πρωταγόρας [http://www.macchiato.com] In Dictionary of Foreign Phrases and Abbreviations (Guinagh, 1965), the following appears: Panton metron anthropos estin. Gk--Man is the measure of all things. Quoted by Plato, Theaetetus, 178b. Best regards, James Kass.
Re: Is there Unicode mail out there?
Sorry - By 'pattern restrictions on mixed content' I meant a feature in XML Schema that would allow to specify that the mixed content in certain elements is restricted by a pattern facet. This is a feature that isn't in XML Schema, but that has been discussed. This would allow to define that a document does not allow C0 control characters, a feature that would be very important for many cases if the basic XML syntax would start to allow C0. Regards, Martin. At 10:32 01/07/19 -0600, Shigemichi Yazawa wrote: At Thu, 19 Jul 2001 15:52:39 +0900, Martin Duerst [EMAIL PROTECTED] wrote: Of course then pattern restrictions on mixed content (which we currently don't have) would become really helpful. Martin, What kind of pattern restrictions are necessary by introducing C0 NCR? Something like this? #x1b;$B --- Shigemichi Yazawa [EMAIL PROTECTED]
RE: Is there Unicode mail out there?
At 01:11 PM 7/19/01 -0500, Mike Ayers wrote: The work has to be done somewhere. Emerging technologies must be compatible with existing ones, and some old technologies hang around a long time. Really, the disallowing of control characters makes sense, since their interpretation in so many exisiting protocols is wreak havoc upon the unsuspecting. You simply can't send these characters around the internet and expect them to arrive unchanged. Does anyone have a (list, web site, reference) which lists which C0 and C1 control codes wreak havoc upon the unsuspecting and why? Bill Kurmey, Edmonton, AB, Canada
RE: Is there Unicode mail out there?
At Thu, 19 Jul 2001 13:11:35 -0500, Ayers, Mike [EMAIL PROTECTED] wrote: I'm proposing it as a convention, not a proprietary solution. I agree that a standard solution would be preferred, especially Martin's suggestion of permitting the escape codes but not the characters. I proposed the markup as a workaround until a better solution could be found. This sounds good. Can we submit a proposition to W3C? I believe that it helps many people. - Shigemichi Yazawa [EMAIL PROTECTED]
Re: Is there Unicode mail out there?
Tex Texin scripsit: Which seemed to me to rule out the NCR for gt; in situations other than ]] for compatibility reasons. If they are needed elsewhere, they must be escaped using either numeric character references or the strings amp; and lt; respectively. The right angle bracket () may be represented using the string gt;, and must, for compatibility, be escaped using gt; or a character reference when it appears in the string ]] in content, when that string is not marking the end of a CDATA section. Naah. Just because it says may doesn't mean anything: what may be done, also may be not done. You may use a numeric character reference for any legal character. -- John Cowan [EMAIL PROTECTED] One art/there is/no less/no more/All things/to do/with sparks/galore --Douglas Hofstadter
Re: Is there Unicode mail out there?
John, ok and thanks. I wasn't looking at the may though, I was looking at the must. Maybe I am not parsing this sentence right. To me it says: (must, for compatibility, be escaped using gt; ) or (a character reference when it appears in the string ]] in content, when that string is not marking the end of a CDATA section.) So it must not be an NCR, EXCEPT in the seemingly rare case where the string ]] appears in content AND that string is not being used to indicate the end of a CDATA section. How is that supposed to be read? tex John Cowan wrote: Tex Texin scripsit: Which seemed to me to rule out the NCR for gt; in situations other than ]] for compatibility reasons. If they are needed elsewhere, they must be escaped using either numeric character references or the strings amp; and lt; respectively. The right angle bracket () may be represented using the string gt;, and must, for compatibility, be escaped using gt; or a character reference when it appears in the string ]] in content, when that string is not marking the end of a CDATA section. Naah. Just because it says may doesn't mean anything: what may be done, also may be not done. You may use a numeric character reference for any legal character. -- John Cowan [EMAIL PROTECTED] One art/there is/no less/no more/All things/to do/with sparks/galore --Douglas Hofstadter -- --- Tex Texin Director, International Business mailto:[EMAIL PROTECTED] +1-781-280-4271 Fax:+1-781-280-4655 the Progress Company 14 Oak Park, Bedford, MA 01730 ---
RE: Is there Unicode mail out there?
From: Tex Texin [mailto:[EMAIL PROTECTED]] So it must not be an NCR, EXCEPT in the seemingly rare case where the string ]] appears in content AND that string is not being used to indicate the end of a CDATA section. How is that supposed to be read? Simple. Since ]] is used to mark the end of a CDATA section, and since CDATA can contain anything, if you want to put the sequence ]] INSIDE your CDATA, then you must escape the , or else it will END your CDATA. In other words, CDATA can contain anything except literal ]]. Think */ and C/C++... HTH, /|/|ike
Re: Is there Unicode mail out there?
The quotation I have is from my college Greek textbook (sadly my fluency has reduced to essentially zero after all these years). Perhaps some Greeks on the list could say which is the more accurate formulation? Mark — πάντων μέτρον ἄνθρωπος — Πρωταγόρας [http://www.macchiato.com] - Original Message - From: Otto Stolz [EMAIL PROTECTED] To: Mark Davis [EMAIL PROTECTED] Cc: unicode [EMAIL PROTECTED] Sent: Friday, July 20, 2001 09:18 Subject: Re: Is there Unicode mail out there? Mark Davis wrote: πάντων μέτρον ἄνθρωπος — Πρωταγόρας You mean “πάντων χρημάτων μέτρον ἄνθρωπος”, dont you? ;-) Best wishes, Otto Stolz
Re: Is there Unicode mail out there?
At Thu, 19 Jul 2001 15:52:39 +0900, Martin Duerst [EMAIL PROTECTED] wrote: Of course then pattern restrictions on mixed content (which we currently don't have) would become really helpful. Martin, What kind of pattern restrictions are necessary by introducing C0 NCR? Something like this? #x1b;$B --- Shigemichi Yazawa [EMAIL PROTECTED]
Re: Is there Unicode mail out there?
I agree with the overall sentiment here, but here's one nit Or you are so lazy that you want to put it [your data] in CDATA section without checking it at all. CDATA sections have a severe problem, which is that there is no way to escape otherwise legal XML characters that can't be represented in the chosen document encoding. The best bet is to avoid CDATA sections altogether. Andy Heninger IBM, Cupertino, CA [EMAIL PROTECTED] - Original Message - From: Shigemichi Yazawa [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Thursday, July 19, 2001 12:03 AM Subject: RE: Is there Unicode mail out there? At Wed, 18 Jul 2001 14:21:35 -0500, Ayers, Mike [EMAIL PROTECTED] wrote: So why not used tagged data to represent C0 and C1 characters? That is what XML is made of. As far as why control characters are not permitted, it seems to ma that this is so that XML documents can be passed around easily, through HTTP, email, FTP and so on, without loss of data. Protocols abound which interpret control characters, so XML files which contain data may get mangled or may mangle the systems which pass them. However, if that data is included as tagged hex digits, no problem will occur either way. XML states Its goal is to enable generic SGML to be served, received, and processed on the Web in the way that is now possible with HTML. But, in my opinion, XML has outgrown its original goal way too far. XML seems to be used in every aspect of software engineering these days. Tagging disallowed characters is one way to work around the problem. But I don't buy this solution for two reasons. 1. Markup is for describing a document's structure. 1 Introduction says Markup encodes a description of the document's storage layout and logical structure. You could do something like charEscape codepoint=000c /. This doesn't express any structure of the document, though. Using a markup merely to escape a character is too hacky, in my opinion. 2. This is a proprietary solution. To get the original character, the apprication needs to know the semantics of the markup and needs to know how to decode the data appropriately. If it's the standard encoding like NCR, that's fine because everybody knows how to deal with it. But the tagging is specific to a DTD. It makes difficult to interchange the data. This character restriction in XML makes a XML document creation difficult. Say you have some data you want to wrap in XML. You don't know much anout the content of the data. What you know about it is its character encoding and that it is textual data. That's fine because you just want to wrap it in XML. You would check if it contains or and convert them to entity references. Or you are so lazy that you want to put it in CDATA section without checking it at all. The problem is that it might contain C0 control codes, which are legal characters for most of the encodings. Unless you are absolutely sure that the data doesn't contain any control codes, you have to check every characters to make sure that you don't produce ill-formed XML document. Even if you find a control, there isn't a standard way to treat it. You end up deleting it or escaping it in a proprietary way. - Shigemichi Yazawa [EMAIL PROTECTED]
RE: Is there Unicode mail out there?
From: John Cowan [mailto:[EMAIL PROTECTED]] I think that any proposal to shrink the range of well-formed documents is simply a nonstarter, regrettable as that is. I had thought that one of the main goals of XML Blueberry was mainframe compatibility. If so, won't they need to disallow the C1 characters which wreak havoc on mainframe terminals? If they can make that change, other relatively minor changes could be made at that time (if ever). That's my thinking, anyway. Should I be crossposting the XML folks on this? /|/|ike
RE: Is there Unicode mail out there?
From: Shigemichi Yazawa [mailto:[EMAIL PROTECTED]] XML states Its goal is to enable generic SGML to be served, received, and processed on the Web in the way that is now possible with HTML. But, in my opinion, XML has outgrown its original goal way too far. XML seems to be used in every aspect of software engineering these days. True, but don't blame W3C for the digital hammer effect. Tagging disallowed characters is one way to work around the problem. But I don't buy this solution for two reasons. 1. Markup is for describing a document's structure. 1 Introduction says Markup encodes a description of the document's storage layout and logical structure. That's how it works in theory. In practice, however, pictures, applets, and many other non-structural components are encoded with markup. 2. This is a proprietary solution. To get the original character, the apprication needs to know the semantics of the markup and needs to know how to decode the data appropriately. If it's the standard encoding like NCR, that's fine because everybody knows how to deal with it. But the tagging is specific to a DTD. It makes difficult to interchange the data. I'm proposing it as a convention, not a proprietary solution. I agree that a standard solution would be preferred, especially Martin's suggestion of permitting the escape codes but not the characters. I proposed the markup as a workaround until a better solution could be found. This character restriction in XML makes a XML document creation difficult. The work has to be done somewhere. Emerging technologies must be compatible with existing ones, and some old technologies hang around a long time. Really, the disallowing of control characters makes sense, since their interpretation in so many exisiting protocols is wreak havoc upon the unsuspecting. You simply can't send these characters around the internet and expect them to arrive unchanged. /|/|ike
Re: Is there Unicode mail out there?
Lars, I was looking at Section 2.4 Character Data and Markup: http://www.w3.org/TR/2000/REC-xml-20001006#syntax Which seemed to me to rule out the NCR for gt; in situations other than ]] for compatibility reasons. If they are needed elsewhere, they must be escaped using either numeric character references or the strings amp; and lt; respectively. The right angle bracket () may be represented using the string gt;, and must, for compatibility, be escaped using gt; or a character reference when it appears in the string ]] in content, when that string is not marking the end of a CDATA section. tex Lars Marius Garshol wrote: * Tex Texin | | XML restricts the character set which by implication restricts the | NCR values. I see that gt; can't use an NCR but lt; can. They can both use NCRs. In fact, the example definitions of the predefined entities do just that: URL: http://www.w3.org/TR/REC-xml#sec-predefined-ent --Lars M. -- --- Tex Texin Director, International Business mailto:[EMAIL PROTECTED] +1-781-280-4271 Fax:+1-781-280-4655 the Progress Company 14 Oak Park, Bedford, MA 01730 ---
Re: Is there Unicode mail out there?
At 14:30 01/07/17 -0700, Mark Davis wrote: In that case the content of the field is not text but an octet string, and you need to do something different, like base64-ing it. The content in the database is not an octet string: it is a text field that happens to have a control code -- a legitimate character code -- in it. Practically every database allows control codes in text fields. (And why are C1 controls allowed? After all, they are even less frequent than C0 controls.) Mark - I understand your dissatisfaction. But the C1 controls are not allowed in HTML4, and according to James Clark, the fact that they are allowed in XML was an oversight. Databases can (and should) keep care of their data. There are very few cases where having control characters in there makes sense. In the most cases, however, they are errors, and if XML gives an incentive to fix them, all the better. I wouldn't want any control codes in a database. Having a control-G may be funny (the joke as I know it goes back to Don Knuth), but something like a control-S is too much of a risk. Regards, Martin.
Re: Is there Unicode mail out there?
I wouldn't want any control codes in a database. Having a control-G may be funny (the joke as I know it goes back to Don Knuth), but something like a control-S is too much of a risk. *You* wouldn't want? There are a lot of characters *I* wish were not in databases, or in use at all. A lot of them may or may not make sense. Whether or not I want them, someone can have a database where they are allowed. By having this (inconsistent) restriction, it simply means I can't be guaranteed full round-tripping from databases to XML and back, no matter what their content. Of course, this is not a huge restriction -- it is simply a gratuitous annoyance. One could even live with something much more onerous, say XML disallowing all characters whose code points were divisible by 4321 -- just have complicated DTDs and shift into base64 if you encounter any of those codes. Mark — πάντων μέτρον ἄνθρωπος — Πρωταγόρας [http://www.macchiato.com] - Original Message - From: Martin Duerst [EMAIL PROTECTED] To: Mark Davis [EMAIL PROTECTED]; John Cowan [EMAIL PROTECTED] Cc: [EMAIL PROTECTED]; Lars Marius Garshol [EMAIL PROTECTED] Sent: Tuesday, July 17, 2001 18:36 Subject: Re: Is there Unicode mail out there? At 14:30 01/07/17 -0700, Mark Davis wrote: In that case the content of the field is not text but an octet string, and you need to do something different, like base64-ing it. The content in the database is not an octet string: it is a text field that happens to have a control code -- a legitimate character code -- in it. Practically every database allows control codes in text fields. (And why are C1 controls allowed? After all, they are even less frequent than C0 controls.) Mark - I understand your dissatisfaction. But the C1 controls are not allowed in HTML4, and according to James Clark, the fact that they are allowed in XML was an oversight. Databases can (and should) keep care of their data. There are very few cases where having control characters in there makes sense. In the most cases, however, they are errors, and if XML gives an incentive to fix them, all the better. I wouldn't want any control codes in a database. Having a control-G may be funny (the joke as I know it goes back to Don Knuth), but something like a control-S is too much of a risk. Regards, Martin.
Re: Is there Unicode mail out there?
* Michael Everson | | Perhaps I have been asleep, but is that notation (#X;) valid | HTML for all Unicode characters? The numeric character reference syntax is defined by SGML, and just referenced by HTML, and in SGML it is defined in terms of the document character set, which is defined by the SGML declaration used by each SGML application (of which HTML is one instance). The numeric character reference syntax can be used to refer to any character in the document character set (as declared by the SGML declaration used by HTML[1]). The document character set used by HTML is Unicode, but some characters have been disallowed, and may not appear in documents, whether directly or by reference. These are U+ - U+0009 U+000B - U+000C U+000E - U+0019 U+007F - U+009F U+D800 - U+DFFF --Lars M. [1] URL: http://www.w3.org/TR/html401/sgml/sgmldecl.html
Re: Is there Unicode mail out there?
* Tex Texin | | XML restricts the character set which by implication restricts the | NCR values. I see that gt; can't use an NCR but lt; can. They can both use NCRs. In fact, the example definitions of the predefined entities do just that: URL: http://www.w3.org/TR/REC-xml#sec-predefined-ent --Lars M.
Re: Is there Unicode mail out there?
* Mark Davis | | The HTML spec depends on the SGML spec for a characterization of | allowable characters. The latter, unfortunately, disallows some | valid Unicode characters (most C0 controls), but inconsistently | allows other similar characters (C1 controls). SGML is silent on the issue of what characters are allowed. It is the SGML declaration used by each application which decides this, and you can easily make an SGML declaration which allows every Unicode character. To wit: !SGML ISO 8879:1986 (WWW) CHARSET BASESET ISO Registration Number 177//CHARSET ISO/IEC 10646-1:1993 UCS-4 with implementation level 3//ESC 2/5 2/15 4/6 DESCSET 0 55296 0 55296 2048UNUSED -- SURROGATES -- 57344 1056768 57344 CAPACITYSGMLREF TOTALCAP15 GRPCAP 15 ENTCAP 15 SCOPEDOCUMENT SYNTAX SHUNCHAR NONE BASESET ISO 646IRV:1991//CHARSET International Reference Version (IRV)//ESC 2/8 4/2 DESCSET 0 128 0 FUNCTION RE13 RS10 SPACE 32 TAB SEPCHAR9 NAMING LCNMSTRT UCNMSTRT LCNMCHAR .-_: UCNMCHAR .-_: NAMECASE GENERAL YES ENTITY NO DELIMGENERAL SGMLREF HCRO #38;#x -- 38 is the number for ampersand -- SHORTREF SGMLREF NAMESSGMLREF QUANTITY SGMLREF ATTCNT 60 -- increased -- ATTSPLEN 65536 -- These are the largest values -- LITLEN 65536 -- permitted in the declaration -- NAMELEN 65536 -- Avoid fixed limits in actual -- PILEN65536 -- implementations of HTML UA's -- TAGLVL 100 TAGLEN 65536 GRPGTCNT 150 GRPCNT 64 FEATURES MINIMIZE DATATAG NO OMITTAG YES RANK NO SHORTTAG YES LINK SIMPLE NO IMPLICIT NO EXPLICIT NO OTHER CONCUR NO SUBDOC NO FORMAL YES APPINFO NONE | That means that it is not possible in HTML (or more importantly, in | XML) to represent all valid Unicode characters in data fields. What would you want to use control characters for in an XML document? --Lars M.
Re: Is there Unicode mail out there?
I had been told by the W3C people that the reason for forbidding control characters in XML and HTML was for compatibility with SGML. I've never checked it, since unfortunately the SGML standard is not online. If not true, that's very interesting. When you are thinking of XML as a general transmission mechanism for data (not just a text document) it becomes clear. Suppose that you have a database, of any sort. Some fields may or may not contain control characters -- since control characters are perfectly legal in many if not all databases. You want to query that database and get a selection, packaged as XML. Unfortunately, you have to invent your own home-brew quoting mechanism for the control characters, since the standard XML does not permit you to represent all of the -- perfectly valid -- characters in that database. And such a home-brew mechanism will not interwork with anything else. Conversely, you could filter out the control characters. That, of course, would corrupt the data. Generally considered a bad thing. Mark — πάντων μέτρον ἄνθρωπος — Πρωταγόρας [http://www.macchiato.com] - Original Message - From: Lars Marius Garshol [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Tuesday, July 17, 2001 02:28 Subject: Re: Is there Unicode mail out there? * Mark Davis | | The HTML spec depends on the SGML spec for a characterization of | allowable characters. The latter, unfortunately, disallows some | valid Unicode characters (most C0 controls), but inconsistently | allows other similar characters (C1 controls). SGML is silent on the issue of what characters are allowed. It is the SGML declaration used by each application which decides this, and you can easily make an SGML declaration which allows every Unicode character. To wit: !SGML ISO 8879:1986 (WWW) CHARSET BASESET ISO Registration Number 177//CHARSET ISO/IEC 10646-1:1993 UCS-4 with implementation level 3//ESC 2/5 2/15 4/6 DESCSET 0 55296 0 55296 2048UNUSED -- SURROGATES -- 57344 1056768 57344 CAPACITYSGMLREF TOTALCAP15 GRPCAP 15 ENTCAP 15 SCOPEDOCUMENT SYNTAX SHUNCHAR NONE BASESET ISO 646IRV:1991//CHARSET International Reference Version (IRV)//ESC 2/8 4/2 DESCSET 0 128 0 FUNCTION RE13 RS10 SPACE 32 TAB SEPCHAR9 NAMING LCNMSTRT UCNMSTRT LCNMCHAR .-_: UCNMCHAR .-_: NAMECASE GENERAL YES ENTITY NO DELIMGENERAL SGMLREF HCRO #38;#x -- 38 is the number for ampersand -- SHORTREF SGMLREF NAMESSGMLREF QUANTITY SGMLREF ATTCNT 60 -- increased -- ATTSPLEN 65536 -- These are the largest values -- LITLEN 65536 -- permitted in the declaration -- NAMELEN 65536 -- Avoid fixed limits in actual -- PILEN65536 -- implementations of HTML UA's -- TAGLVL 100 TAGLEN 65536 GRPGTCNT 150 GRPCNT 64 FEATURES MINIMIZE DATATAG NO OMITTAG YES RANK NO SHORTTAG YES LINK SIMPLE NO IMPLICIT NO EXPLICIT NO OTHER CONCUR NO SUBDOC NO FORMAL YES APPINFO NONE | That means that it is not possible in HTML (or more importantly, in | XML) to represent all valid Unicode characters in data fields. What would you want to use control characters for in an XML document? --Lars M.
Re: Is there Unicode mail out there?
In that case the content of the field is not text but an octet string, and you need to do something different, like base64-ing it. The content in the database is not an octet string: it is a text field that happens to have a control code -- a legitimate character code -- in it. Practically every database allows control codes in text fields. (And why are C1 controls allowed? After all, they are even less frequent than C0 controls.) Your task is to design an XML DTD to represent a selection from a database. The database is nothing fancy: Latin-1 encoded. It is conceivable that a control character is in one of the hundreds of thousands of records. Not likely, but conceivable. You must guarantee no loss of data in the XML representation of the data. If XML could represent all control characters, then an instance of a selection in XML might be as simple as the following. record firstnameJohn/firstname lastnameSmith/lastname birthdate1950-10-10/birthdate ... /record The DTD would also be simple. Now, change the DTD (*and* the program that interprets it) so that each and every text field could be a base64 instead. Very ugly. You don't want to simply change all the fields to base64, since that would (a) bulk them up and (b) make them unreadable for debugging. So you end up having each field have two alternate representations. And in your parser you have to be prepared for either, and in your generator you have to pick between them. Notice that for *any* database that allows control codes, to avoid data corruption you would have to do such ugliness for any XML representation. Of course, nobody does it, which means that there is always the opportunity for data corruption. Of course, one might just not care -- after all, it would be rare that this would cause a problem. Mark — πάντων μέτρον ἄνθρωπος — Πρωταγόρας [http://www.macchiato.com] - Original Message - From: John Cowan [EMAIL PROTECTED] To: Mark Davis [EMAIL PROTECTED] Cc: [EMAIL PROTECTED]; Lars Marius Garshol [EMAIL PROTECTED]; Martin Duerst [EMAIL PROTECTED] Sent: Tuesday, July 17, 2001 11:10 Subject: Re: Is there Unicode mail out there? Mark Davis wrote: I had been told by the W3C people that the reason for forbidding control characters in XML and HTML was for compatibility with SGML. More accurately, with the SGML default syntax, which is used in HTML and (with a few modifications) in XML. When you are thinking of XML as a general transmission mechanism for data (not just a text document) it becomes clear. Suppose that you have a database, of any sort. Some fields may or may not contain control characters -- since control characters are perfectly legal in many if not all databases. You want to query that database and get a selection, packaged as XML. In that case the content of the field is not text but an octet string, and you need to do something different, like base64-ing it. -- There is / one art || John Cowan [EMAIL PROTECTED] no more / no less || http://www.reutershealth.com to do / all things || http://www.ccil.org/~cowan with art- / lessness \\ -- Piet Hein
Re: Is there Unicode mail out there?
In a message dated 2001-07-17 2:24:44 Pacific Daylight Time, [EMAIL PROTECTED] writes: The document character set used by HTML is Unicode, but some characters have been disallowed, and may not appear in documents, whether directly or by reference. These are U+ - U+0009 U+000B - U+000C U+000E - U+0019 U+007F - U+009F U+D800 - U+DFFF This list, and others like it, needs to be updated to include the non-characters (0xFDD0 through 0xFDEF, plus all code points whose low-order 16 bits are 0xFFFE or 0x). I was just looking through the XML spec today, and the only non-characters excluded (other than the surrogates) are 0xFFFE and 0x. -Doug Ewell Fullerton, California
Re: Is there Unicode mail out there?
[EMAIL PROTECTED] scripsit: I was just looking through the XML spec today, and the only non-characters excluded (other than the surrogates) are 0xFFFE and 0x. Unfortunately, there's nothing we can do about it now, nor about the useless C1 controls other than NEL. Shrinking the range of well-formed documents is an immediate loser, even if there is no plausible use for such documents. Just pretend you'll never get one of the legal non-characters. -- John Cowan [EMAIL PROTECTED] One art/there is/no less/no more/All things/to do/with sparks/galore --Douglas Hofstadter
Re: Is there Unicode mail out there?
At Sat, 14 Jul 2001 09:49:30 -0700, Mark Davis [EMAIL PROTECTED] wrote: No, but it is for the vast majority. Some have to be written specially, e.g. lt; I looked at XML 1.0 spec and it says in 2.4 Character Data and Markup that If they are needed elsewhere, they must be escaped using either numeric character references or the strings amp; and lt; respectively. I also looked at HTML 4.01 spec and it doesn't say in 5.3.2 Character entity references that #60; cannot be used to represent . Some cannot be written at all, e.g. U+0007 (but U+0087 can be!) This is true for XML, but I couldn't find any statement in HTML 4.01 spec to restrict the use of U+0007 in HTML document. By the way, I have been pondering why, in XML, all the C1 control characters are legal but some of the C0 control characters are not. 2.2 Characters says that Legal characters are tab, carriage return, line feed, and the legal characters of Unicode and ISO/IEC 10646. and the BNF for Char is this. [2] Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] |/* any Unicode character, [#xE000-#xFFFD] | [#x1-#x10] excluding the surrogate blocks, FFFE, and . */ Does this mean C0 controls are not legal Unicode characters? --- Shigemichi Yazawa [EMAIL PROTECTED]
Re: Is there Unicode mail out there?
The HTML spec depends on the SGML spec for a characterization of allowable characters. The latter, unfortunately, disallows some valid Unicode characters (most C0 controls), but inconsistently allows other similar characters (C1 controls). That means that it is not possible in HTML (or more importantly, in XML) to represent all valid Unicode characters in data fields. Mark — πάντων μέτρον ἄνθρωπος — Πρωταγόρας [http://www.macchiato.com] - Original Message - From: Shigemichi Yazawa [EMAIL PROTECTED] To: [EMAIL PROTECTED] Cc: [EMAIL PROTECTED]; [EMAIL PROTECTED] Sent: Monday, July 16, 2001 12:12 Subject: Re: Is there Unicode mail out there? At Sat, 14 Jul 2001 09:49:30 -0700, Mark Davis [EMAIL PROTECTED] wrote: No, but it is for the vast majority. Some have to be written specially, e.g. lt; I looked at XML 1.0 spec and it says in 2.4 Character Data and Markup that If they are needed elsewhere, they must be escaped using either numeric character references or the strings amp; and lt; respectively. I also looked at HTML 4.01 spec and it doesn't say in 5.3.2 Character entity references that #60; cannot be used to represent . Some cannot be written at all, e.g. U+0007 (but U+0087 can be!) This is true for XML, but I couldn't find any statement in HTML 4.01 spec to restrict the use of U+0007 in HTML document. By the way, I have been pondering why, in XML, all the C1 control characters are legal but some of the C0 control characters are not. 2.2 Characters says that Legal characters are tab, carriage return, line feed, and the legal characters of Unicode and ISO/IEC 10646. and the BNF for Char is this. [2] Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] |/* any Unicode character, [#xE000-#xFFFD] | [#x1-#x10] excluding the surrogate blocks, FFFE, and . */ Does this mean C0 controls are not legal Unicode characters? --- Shigemichi Yazawa [EMAIL PROTECTED]
RE: Is there Unicode mail out there?
Gaute B Strokkenes wrote: ... That's the only benefit that Unicode and UTF-8 will bring to email: the ability to mix and match characters from all scripts of all sizes and shapes in a single message. OTOH, for those of us who need this it's a big advantage. There are also a number of scripts which don't have any registered encoding or code-page except Unicode / ISO-10646 - for users of those scripts, whether or not they want to mix characters from other scripts, Unicode / UTF-8 is the only real choice (unless they want to use some non-standard font based encoding). However, since many of these scripts are also complex scripts, clients need to be able to render them properly to be of much use with these scripts. - Chris
RE: Is there Unicode mail out there?
Mark Davies wrote: Take a look at the XML standard. Mark The thread was discussing HTML. Are there any restrictions on numeric character references in the *HTML* standard? - Chris
Re: Is there Unicode mail out there?
Mark, ok thanks. XML restricts the character set which by implication restricts the NCR values. I see that gt; can't use an NCR but lt; can. tex Mark Davis wrote: Take a look at the XML standard. Mark - Original Message - From: Tex Texin [EMAIL PROTECTED] Hi. I am not sure why you say this. lt; is often used for but #X003C; works in both IE 5 and Netscape 4.7. #X0007; shows a box though... But I was not aware of any restrictions on numeric character references. Is there a list of restrictions somewhere? tex Mark Davis wrote: No, but it is for the vast majority. Some have to be written specially, e.g. lt; Some cannot be written at all, e.g. U+0007 (but U+0087 can be!) -- --- Tex Texin Director, International Business mailto:[EMAIL PROTECTED] +1-781-280-4271 Fax:+1-781-280-4655 the Progress Company 14 Oak Park, Bedford, MA 01730 ---
Re: Is there Unicode mail out there?
yes - Original Message - From: Christopher J Fynn [EMAIL PROTECTED] To: [EMAIL PROTECTED] Cc: Mark Davis [EMAIL PROTECTED] Sent: Saturday, July 14, 2001 22:57 Subject: RE: Is there Unicode mail out there? Mark Davies wrote: Take a look at the XML standard. Mark The thread was discussing HTML. Are there any restrictions on numeric character references in the *HTML* standard? - Chris
Re: Is there Unicode mail out there?
From: Gaute B Strokkenes [EMAIL PROTECTED] No way. Any mail client that is sufficiently clever to understand UTF-8 should understand all valid and registered MIME-charsets. After all, conversion libraries are both widely available and easy to use. Do you know of any that actually do? How about just supporting these: ISO646-PT, ISO10646-UTF-1, NATS-SEFI and HP-DeskTop? All the `all messages should be in UTF-8, even when there are well-established legacy encodings that cover the characters of a given message' mumbo-jumbo that has been mentioned recently on the list is really just so much hot air. I don't think anyone was suggesting that for all lists. However, here, on the Unicode list, everyone on the list should be able to handle Unicode, and those who can have sometimes been willing to cut and paste into a Unicode editor just to see what's up. Legacy encodings should be used when you're communicating with people who use legacy encodings and legacy mail readers. Unicode people don't - after ASCII, UTF-8 is probably the closest thing we have to a common usable encoding. -- David Starner - [EMAIL PROTECTED]
Re: Is there Unicode mail out there?
On Sat, 14 Jul 2001, [EMAIL PROTECTED] wrote: From: Gaute B Strokkenes [EMAIL PROTECTED] No way. Any mail client that is sufficiently clever to understand UTF-8 should understand all valid and registered MIME-charsets. After all, conversion libraries are both widely available and easy to use. Do you know of any that actually do? Actually do convert messages in arbitrary charsets to UTF-8 / Unicode, you mean? Any reasonably modern mail client will. IIRC Microsoft OE and friends do everything in Unicode internally and only convert to other encodings when receiving or sending mail. (Though OE is broken in so many other ways that I wouldn't recommend it.) Gnus/Emacs does too (actually it uses the Emacs MULE encoding internally, but from the users perspective the effect is precisely the same). My argument is based on the fact that if you have put in the necessary work to interpret UTF-8 messages, then it does not take at all that much extra effort to interpret messages in other charsets by running them through a converter first. I postulate that libraries to perform this function are both widely available and highly portable; if you do not agree then I would be happy to point out concrete examples. How about just supporting these: ISO646-PT, ISO10646-UTF-1, NATS-SEFI and HP-DeskTop? I'm not sure what you're trying to say here. Assuming these are properly registered charsets, it seems like a very narrow range to support. If they're not, then they have no place in email whatsoever (and UTF-8 is clearly a better choice.) I don't think anyone was suggesting that for all lists. However, here, on the Unicode list, everyone on the list should be able to handle Unicode, and those who can have sometimes been willing to cut and paste into a Unicode editor just to see what's up. I don't think that holds. People on the unicode list are not necessarily Unicode boffins, although a lot of the active people are. Some of us are just here because we have an interest in, say, i18n in general and like to keep an eye on things. If we all had to upgrade our software to do so, I think a lot of people just wouldn't bother. That way, everyone loses. Note that I think it is appropriate to use UTF-8 when there's just no common charset that can represent a given message. Legacy encodings should be used when you're communicating with people who use legacy encodings and legacy mail readers. Unicode people don't - after ASCII, UTF-8 is probably the closest thing we have to a common usable encoding. It's the closest thing that we have to a common _universal_ charset. For messages that do not require the `universal' property, there are many charsets that are just as sensible and, more to the point, much better supported. -- Gaute Strokkeneshttp://www.srcf.ucam.org/~gs234/ Yow! Am I in Milwaukee?
Re: Is there Unicode mail out there?
At 11:07 -0400 2001-07-13, Tex Texin wrote: Maybe writing the value as an HTML numeric character reference (e.g. #X20AC;) would also make it easier for processes reading files saved by the mailer to recover the character. Perhaps I have been asleep, but is that notation (#X;) valid HTML for all Unicode characters? -- Michael Everson
Re: Is there Unicode mail out there?
On Sat, Jul 14, 2001 at 01:10:15PM +0100, Michael Everson wrote: At 11:07 -0400 2001-07-13, Tex Texin wrote: Maybe writing the value as an HTML numeric character reference (e.g. #X20AC;) would also make it easier for processes reading files saved by the mailer to recover the character. Perhaps I have been asleep, but is that notation (#X;) valid HTML for all Unicode characters? Since HTML 4, yes: http://www.w3.org/TR/html4/charset.html#h-5.3.1 -- Daniel Biddle [EMAIL PROTECTED]
Re: Is there Unicode mail out there?
No, but it is for the vast majority. Some have to be written specially, e.g. lt; Some cannot be written at all, e.g. U+0007 (but U+0087 can be!) Mark - Original Message - From: Michael Everson [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Saturday, July 14, 2001 05:10 Subject: Re: Is there Unicode mail out there? At 11:07 -0400 2001-07-13, Tex Texin wrote: Maybe writing the value as an HTML numeric character reference (e.g. #X20AC;) would also make it easier for processes reading files saved by the mailer to recover the character. Perhaps I have been asleep, but is that notation (#X;) valid HTML for all Unicode characters? -- Michael Everson
Re: Is there Unicode mail out there?
At 09:49 -0700 2001-07-14, Mark Davis wrote: Maybe writing the value as an HTML numeric character reference (e.g. #X20AC;) would also make it easier for processes reading files saved by the mailer to recover the character. Perhaps I have been asleep, but is that notation (#X;) valid HTML for all Unicode characters? No, but it is for the vast majority. Some have to be written specially, e.g. lt; Some cannot be written at all, e.g. U+0007 (but U+0087 can be!) Then it's not standard and can't be relied upon. Pity. -- Michael Everson
Re: Is there Unicode mail out there?
michka the only book on internationalization in VB at http://www.i18nWithVB.com/ - Original Message - From: Michael Everson [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Saturday, July 14, 2001 9:56 AM Subject: Re: Is there Unicode mail out there? At 09:49 -0700 2001-07-14, Mark Davis wrote: Maybe writing the value as an HTML numeric character reference (e.g. #X20AC;) would also make it easier for processes reading files saved by the mailer to recover the character. Perhaps I have been asleep, but is that notation (#X;) valid HTML for all Unicode characters? No, but it is for the vast majority. Some have to be written specially, e.g. lt; Some cannot be written at all, e.g. U+0007 (but U+0087 can be!) Then it's not standard and can't be relied upon. Pity. -- Michael Everson
Re: Is there Unicode mail out there?
From: Michael Everson [EMAIL PROTECTED] Then it's not standard and can't be relied upon. Pity. Actually, it is a standard, as of HTML 4.0. All you need is compliant browser. MichKa Michael Kaplan Trigeminal Software, Inc. http://www.trigeminal.com/
Re: Is there Unicode mail out there?
At 12:03 2001-07-13 EDT, [EMAIL PROTECTED] wrote: Unfortunately, the Windows world has no concept of a Last Resort font. It would certainly seem to be a useful solution in cases like this. Does a PostScript, Type 1, version of such a font exist for download somewhere? Adam --- http://phonecowboy.com/registrar/twist/ finds a good domain for you and checks for its existence.
Re: Is there Unicode mail out there?
From: Gaute B Strokkenes [EMAIL PROTECTED] On Sat, 14 Jul 2001, [EMAIL PROTECTED] wrote: From: Gaute B Strokkenes [EMAIL PROTECTED] No way. Any mail client that is sufficiently clever to understand UTF-8 should understand all valid and registered MIME-charsets. After all, conversion libraries are both widely available and easy to use. Do you know of any that actually do? Actually do convert messages in arbitrary charsets to UTF-8 / Unicode, you mean? No, I mean understand all valid and registered MIME-charsets. How about just supporting these: ISO646-PT, ISO10646-UTF-1, NATS-SEFI and HP-DeskTop? I'm not sure what you're trying to say here. Assuming these are properly registered charsets, it seems like a very narrow range to support. Maybe supporting at least these would have been a better phrasing. They're all valid and registered MIME-charsets. Do you know of a single mailer that supports all 4? If we all had to upgrade our software to do so, I think a lot of people just wouldn't bother. You're claiming on one hand that everyone's mailer should handle all sorts of charsets, and on the other using one that doesn't support the only charset that is RFC-mandated for a working mail program to support. (Yes, a mailer that doesn't handle UTF-8 violates the appropriate RFCs.) It's the closest thing that we have to a common _universal_ charset. You sure? Besides ASCII, what other charset can almost everyone read (including the people who cut and paste into Unicode editors, because they can read it)? There's no other charset (besides ASCII) that everyone with a working mailer, no matter how minimal, can read. -- David Starner - [EMAIL PROTECTED]
Re: Is there Unicode mail out there?
On Sat, 14 Jul 2001, [EMAIL PROTECTED] wrote: How about just supporting these: ISO646-PT, ISO10646-UTF-1, NATS-SEFI and HP-DeskTop? I'm not sure what you're trying to say here. Assuming these are properly registered charsets, it seems like a very narrow range to support. Maybe supporting at least these would have been a better phrasing. They're all valid and registered MIME-charsets. Do you know of a single mailer that supports all 4? OK, I get your point. There are a lot of obscure charsets out there, and it's probably not necessary to make sure that mail clients understand all of them since a lot of these have no precedent for use in email. Nevertheless, there are a number of charsets--ISO-8859-1, ISO-8859-2, KOI8-R, Shift_JIS and so on--that have widespread precedent for use in email, and are de-facto standards for email in certain languages. It would be extremely foolish to implement a mail client that understands UTF-8 but not these. If we all had to upgrade our software to do so, I think a lot of people just wouldn't bother. You're claiming on one hand that everyone's mailer should handle all sorts of charsets, and on the other using one that doesn't support the only charset that is RFC-mandated for a working mail program to support. I'm sorry, but you're mixing things up a bit. Keep in mind that in general there is a difference between what processes implementing Internet protocols should generate and what they are required to accept. One of the principles that the Internet is founded on is to be liberal in what you accept, and conservative in what you produce. (Yes, a mailer that doesn't handle UTF-8 violates the appropriate RFCs.) Chapter and verse, please? The only document I could find that puts forth such a requirement is the one at: http://www.imc.org/mail-i18n.html which is not a RFC. Other than that, there is RFC 2277; however this only states that protocols must make it possible to exchange textual data using UTF-8; it doesn't make it mandatory to understand UTF-8. RFC 2049 only states that US-ASCII must be understood, and the same for the ISO-8859-X charsets, except that you're not required to be able to display the non-ASCII characters they contain. There's no mention of UTF-8. If you have any better references, please provide them. (I do not claim to have encyclopedic knowledge off the subject.) Note that the IMC document does not encourage mail clients to produce UTF-8 by default, it only states that mail clients should be able to interpret it and given users the option to create messages in UTF-8. It explicitly recognises that that few mail clients implemented good UTF-8 support at the time. That was three years ago, and little has changed since. It is only very recently that good UTF-8 support has become standard for new clients, and there are still lots and lots of old clients that have no UTF-8 support at all. It is certainly clear that the time scale hinted at in the document (that all mail clients created or revised after 1 January 1999 should be able to interpret UTF-8) was hopelessly optimistic. We're not there yet, even though we're getting closer. It's the closest thing that we have to a common _universal_ charset. You sure? Besides ASCII, what other charset can almost everyone read (including the people who cut and paste into Unicode editors, because they can read it)? There's no other charset (besides ASCII) that everyone with a working mailer, no matter how minimal, can read. Well, I'm saying that UTF-8 / Unicode is the closest thing that we have to a universal charset. (I meant universal as in universal character repertoire, not universally supported.) There are many charsets that are better supported in general than UTF-8; ASCII and ISO-8859-1 are two of them. However, the problem in question is not to choose the best charset in general, but to choose the best possible charset for a given message containing a given set of characters. RFC 2046 states: More generally, if a widely-used character set is a subset of another character set, and a body contains only characters in the widely-used subset, it should be labelled as being in that subset. This will increase the chances that the recipient will be able to view the resulting entity correctly. I think this is good advice. Consider the scenario where a group of people are accustomed to exchanging email in the language of their choice in a particular charset with little difficulty. Then some members of the group upgrade their software, and the other members of the group can then no longer read their messages, since the new software insists on using UTF-8 (which the older software does not support). That's bad, and the above advice avoids this situation. -- Gaute Strokkeneshttp://www.srcf.ucam.org/~gs234/ I'm thinking about DIGITAL READ-OUT systems and computer-generated IMAGE FORMATIONS..
Re: Is there Unicode mail out there?
Mark, Hi. I am not sure why you say this. lt; is often used for but #X003C; works in both IE 5 and Netscape 4.7. #X0007; shows a box though... But I was not aware of any restrictions on numeric character references. Is there a list of restrictions somewhere? tex Mark Davis wrote: No, but it is for the vast majority. Some have to be written specially, e.g. lt; Some cannot be written at all, e.g. U+0007 (but U+0087 can be!) Mark - Original Message - From: Michael Everson [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Saturday, July 14, 2001 05:10 Subject: Re: Is there Unicode mail out there? At 11:07 -0400 2001-07-13, Tex Texin wrote: Maybe writing the value as an HTML numeric character reference (e.g. #X20AC;) would also make it easier for processes reading files saved by the mailer to recover the character. Perhaps I have been asleep, but is that notation (#X;) valid HTML for all Unicode characters? -- Michael Everson -- --- Tex Texin Director, International Business mailto:[EMAIL PROTECTED] +1-781-280-4271 Fax:+1-781-280-4655 the Progress Company 14 Oak Park, Bedford, MA 01730 ---
Re: Is there Unicode mail out there?
Take a look at the XML standard. Mark - Original Message - From: Tex Texin [EMAIL PROTECTED] To: Mark Davis [EMAIL PROTECTED] Cc: [EMAIL PROTECTED]; Michael Everson [EMAIL PROTECTED] Sent: Saturday, July 14, 2001 21:15 Subject: Re: Is there Unicode mail out there? Mark, Hi. I am not sure why you say this. lt; is often used for but #X003C; works in both IE 5 and Netscape 4.7. #X0007; shows a box though... But I was not aware of any restrictions on numeric character references. Is there a list of restrictions somewhere? tex Mark Davis wrote: No, but it is for the vast majority. Some have to be written specially, e.g. lt; Some cannot be written at all, e.g. U+0007 (but U+0087 can be!) Mark - Original Message - From: Michael Everson [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Saturday, July 14, 2001 05:10 Subject: Re: Is there Unicode mail out there? At 11:07 -0400 2001-07-13, Tex Texin wrote: Maybe writing the value as an HTML numeric character reference (e.g. #X20AC;) would also make it easier for processes reading files saved by the mailer to recover the character. Perhaps I have been asleep, but is that notation (#X;) valid HTML for all Unicode characters? -- Michael Everson -- --- Tex Texin Director, International Business mailto:[EMAIL PROTECTED] +1-781-280-4271 Fax:+1-781-280-4655 the Progress Company 14 Oak Park, Bedford, MA 01730 ---
Re: Is there Unicode mail out there?
In a message dated 2001-07-12 8:55:07 Pacific Daylight Time, [EMAIL PROTECTED] writes: So the proposal is that minimizing the charset is a good thing? This means that you and I start out in a conversation about a product I am trying to sell you, it happens to be all in ascii and we exchange several mails successfully. Then I quote you a price in Euros and my 1252 message gets corrupted by your reader which can handle either only 8859-1 or ASCII, and you miss the fact that the Euro is corrupted and think we are talking dollars or some other currency. Although I understand why you would want a minimal charset in order to not needlessly prevent communications, the implication of reliability and trust that is built by having some success is a problem. You think you are communicating successfully but when it is critical it may not... The premise seems to be that we should reject, or at least issue a warning against, the earlier messages on the basis that the sender *might* be able to send characters in the future that the receiver could not receive. Sorry, but I can't buy into that. That would prevent the CP1252 user from ever being able to communicate adequately with anyone who has only ISO 8859-1. What if I am trying to exchange mail with a user of Windows-1256? Lots of roadblocks would be erected because of the chance that the guy *might* send me ARABIC LETTER ALEF WITH HAMZA BELOW and I couldn't interpret it. And I couldn't exchange mail with UTF-8 users either, because of that YI SYLLABLE BBOP they might send me some day. Perhaps if a harder line was taken when characters are used that cannot be converted, this would make more sense. (ie give a very clear recognizable indication of corruption or conversion failures) That's reasonable. Simply replacing unknown characters with '?' doesn't work; the character is too easily overlooked. I would like to see mailers replace unsupported characters with a Unicode representation like [U+A068]. (That would certainly help with this spate of CJK characters that people are sending lately on the Unicode list!) I suspect that's too much Unicode awareness to ask of an otherwise Unicode-unaware product, though. -Doug Ewell Fullerton, California
Re: Is there Unicode mail out there?
$B!!!z$8$e$&$$$C$A$c$s!z(B $B!!;d$O$m$3$($s$i$+$Y$5!#(B Am 2001-07-13 um 2:53 h EDT hat Doug Ewell geschrieben: Simply replacing unknown characters with '?' doesn't work; the character is too easily overlooked. I would like to see mailers replace unsupported characters with a Unicode representation like "[U+A068]". For "ordinary users", i. e., those users who don't have the TUS 3.0 tome lying next to their computers, a "last resort glyph" would probably be more helpful, cf. http://crl.nmsu.edu/~mleisher/lr.html and http://fonts.apple.com/LastResort/LastResort.html. Best wishes, Otto Stolz They can look it up online. Yes, it is a tome Not just a book, a TOME.
Re: Is there Unicode mail out there?
Doug, I thought I had acknowledged the rationale for supporting labeling the message with the minimal charset based on each message's contents in the beginning of the third paragraph, but maybe I should have expanded on it. Anyway, despite the benefit it is a significant problem that it is unreliable and that past performance does not predict future performance or whatever the phrase is that the financial markets use. I was mostly stage setting for the idea that there should be a clear indicator for a failed character conversion. The last resort proposal is ok. I agree with you about seeing the hex value for the missing character with the symbol. (I've already been forced to learn the unicode codepoint for the Euro by heart... I would probably recognize most of the commonly failed characters if the code points were available.) Maybe writing the value as an HTML numeric character reference (e.g. #X20AC;) would also make it easier for processes reading files saved by the mailer to recover the character. (By using a standard representation and also one that is not likely to appear in an email, unless the email is about character references...) For the unicode-unaware the syntax could allow inclusion of the original code page label: #X0080:windows1256; Anyway, this problem that characters that do not convert in mails are not being clearly indicated: occurs frequently, can have significant impact to users, seems to have some cheap workarounds, that are better than either just relabeling to the lowest common denominator or preventing communications entirely. tex [EMAIL PROTECTED] wrote: In a message dated 2001-07-12 8:55:07 Pacific Daylight Time, [EMAIL PROTECTED] writes: So the proposal is that minimizing the charset is a good thing? This means that you and I start out in a conversation about a product I am trying to sell you, it happens to be all in ascii and we exchange several mails successfully. Then I quote you a price in Euros and my 1252 message gets corrupted by your reader which can handle either only 8859-1 or ASCII, and you miss the fact that the Euro is corrupted and think we are talking dollars or some other currency. Although I understand why you would want a minimal charset in order to not needlessly prevent communications, the implication of reliability and trust that is built by having some success is a problem. You think you are communicating successfully but when it is critical it may not... The premise seems to be that we should reject, or at least issue a warning against, the earlier messages on the basis that the sender *might* be able to send characters in the future that the receiver could not receive. Sorry, but I can't buy into that. That would prevent the CP1252 user from ever being able to communicate adequately with anyone who has only ISO 8859-1. What if I am trying to exchange mail with a user of Windows-1256? Lots of roadblocks would be erected because of the chance that the guy *might* send me ARABIC LETTER ALEF WITH HAMZA BELOW and I couldn't interpret it. And I couldn't exchange mail with UTF-8 users either, because of that YI SYLLABLE BBOP they might send me some day. Perhaps if a harder line was taken when characters are used that cannot be converted, this would make more sense. (ie give a very clear recognizable indication of corruption or conversion failures) That's reasonable. Simply replacing unknown characters with '?' doesn't work; the character is too easily overlooked. I would like to see mailers replace unsupported characters with a Unicode representation like [U+A068]. (That would certainly help with this spate of CJK characters that people are sending lately on the Unicode list!) I suspect that's too much Unicode awareness to ask of an otherwise Unicode-unaware product, though. -Doug Ewell Fullerton, California -- --- Tex Texin Director, International Business mailto:[EMAIL PROTECTED] +1-781-280-4271 Fax:+1-781-280-4655 the Progress Company 14 Oak Park, Bedford, MA 01730 ---
Re: Is there Unicode mail out there?
In a message dated 2001-07-13 5:27:41 Pacific Daylight Time, [EMAIL PROTECTED] writes: @š‚¶‚イ‚¢‚Á‚¿‚á‚ñš @Ž„‚͂낱‚¦‚ñ‚ç‚©‚ׂ³B Robert, please stop this. It doesn't seem to be UTF-8 (that is, I can't copy and paste it into UniPad or Windows 2000 Notepad and see anything reasonable), and even if it were, neither I nor many other list members can read Japanese. We had this discussion earlier in the year about English vs. French, and other than exceptions like Patrick Andries' message (which was explicitly about a French translation), this is basically an English-language list. It is certainly cool to ask questions about this or that Japanese character, but simply posting an unreadable Japanese response to my English-language message makes no sense. -Doug Ewell Fullerton, California
Re: Is there Unicode mail out there?
In a message dated 2001-07-13 4:06:39 Pacific Daylight Time, [EMAIL PROTECTED] writes: For ordinary users, i. e., those users who don't have the TUS 3.0 tome lying next to their computers, a last resort glyph would probably be more helpful, cf. http://crl.nmsu.edu/~mleisher/lr.html and http://fonts.apple.com/LastResort/LastResort.html. Unfortunately, the Windows world has no concept of a Last Resort font. It would certainly seem to be a useful solution in cases like this. -Doug Ewell Fullerton, California
RE: Is there Unicode mail out there?
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]] In a message dated 2001-07-13 5:27:41 Pacific Daylight Time, [EMAIL PROTECTED] writes: @š‚¶‚イ‚¢‚Á‚¿‚á‚ñš @Ž„‚͂낱‚¦‚ñ‚ç‚©‚ׂ³B Robert, please stop this. It doesn't seem to be UTF-8 (that is, I can't copy and paste it into UniPad or Windows 2000 Notepad and see anything It's ISO-2022-JP, if that helps. character, but simply posting an unreadable Japanese response to my English-language message makes no sense. Ever think that maybe that's why he does it? Anyway, here's a hint. As someone who can read a little Japanese, I have never translated anything in one of 11DB's messages that really mattered. Anything that he wants us to see is put in English, so you can probably safely ignore the question marks. On the other hand... Yo, 11DB, get with the program! Use UTF-8: where do ya think ya are? The point is to confuse people, not frustrate them. ;-) /|/|ike
Re: Is there Unicode mail out there?
Doug Ewell wrote... @š‚¶‚イ‚¢‚Á‚¿‚á‚ñš @Ž„‚͂낱‚¦‚ñ‚ç‚©‚ׂ³B Robert, please stop this. It doesn't seem to be UTF-8 (that is, I can't copy and paste it into UniPad or Windows 2000 Notepad and see anything reasonable) Eeek.. What's that? 11's comment shows up fine in my mail reader here, as Japanese chars. But what I got was, I believe, watashi wa rokoenrakabesa which isn't any Japanese that I can parse, and it should have a comma after wa in any case. Roko isn't a word, though rouko and roukou are (and don't make sense here). Besa isn't a verb ending, even in classical Japanese, and I can't imagine what it's supposed to mean. Enraka isn't a word, and koen isn't a word though kouen is... Hm. It's gibberish anyway, so it wouldn't matter if it came through. Just looks like nearly random syllables generated by someone who doesn't write the language. Rick
FW: Re: Is there Unicode mail out there?
Those are MOJIBAKE for my SIG. 1) I think that is mojibake for my name. It looks familiar. 2) The second one reads, if I rightly remember, "Watashi wa loco en la cabeza". If I get a mojibakus or two in a Chinese sig, I don't say anything. (Is mojibakus the singular of mojibake? Perhaps "mojibakum"?) $B$8$e$&$$$C$A$c$s(B --- Original Message --- $B:9=P?M(B: [EMAIL PROTECTED]; $B08@h(B: [EMAIL PROTECTED]; Cc: [EMAIL PROTECTED]; $BF|;~(B: 01/07/13 15:29 $B7oL>(B: Re: Is there Unicode mail out there? In a message dated 2001-07-13 5:27:41 Pacific Daylight Time, [EMAIL PROTECTED] writes: $B%D!!%D!&!#c`TD%+c`TE!Wc`TD!"c`TD!Vc`TE"d?TD%=c`TE!#c`TE%"%D!&!#(B $B%D!!%J%9c`\d?TE:d?TE%)c`TD%"c`TD%rc`TE%"c`TE%!c`TD%%c`TENd?TD%&%D!#(B Robert, please stop this. It doesn't seem to be UTF-8 (that is, I can't copy and paste it into UniPad or Windows 2000 Notepad and see anything reasonable), and even if it were, neither I nor many other list members can read Japanese. We had this discussion earlier in the year about English vs. French, and other than exceptions like Patrick Andries' message (which was explicitly about a French translation), this is basically an English-language list. It is certainly cool to ask questions about this or that Japanese character, but simply posting an unreadable Japanese response to my English-language message makes no sense. -Doug Ewell Fullerton, California
Re: FW: Re: Is there Unicode mail out there?
Watashi wa loco en la cabeza Duh, well, use katakana as appropriate, use middle-dots between your foreign words, and people might get it. Rick
RE: Re: Is there Unicode mail out there?
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]] Those are MOJIBAKE for my SIG. Which is what you deserve for not sending UTF-8. Until you upgrade your mailer, your name wil be @š‚¶‚イ‚¢‚Á‚¿‚á‚ñš. :-p 1) I think that is mojibake for my name. It looks familiar. See above. 2) The second one reads, if I rightly remember, Watashi wa loco en la cabeza. Yep: 私はろこえんらかべさ。You're still hung up on this use kana to 1represent any language thing, huh? You've got that in common with the Japanese - I was quite surprised to find that most Japanese don't know that their katakana versions of English words don't sound much like English words. Anyway, if I ever meet a Spanish and Japanese fluent individual, I'll wave it under their nose to see if they catch it. They won't, though, since you're using hiragana instead of katakana. Here's some for you to transliterate: 1.) The bull's nose ring is where we attach the taurine towline. 2.) Raul studies lore. 3.) My file said he was vile. 4.) Fu did that. Fu who? Etc., etc., etc. If I get a mojibakus or two in a Chinese sig, I don't say anything. (Is mojibakus the singular of mojibake? Perhaps mojibakum?) You're the Japanese enthusiast - look it up! /|/|ike
Re: Is there Unicode mail out there?
Rick McGowan wrote: Eeek.. What's that? 11's comment shows up fine in my mail reader here, as Japanese chars. But what I got was, I believe, watashi wa rokoenrakabesa which isn't any Japanese that I can parse, and it should have a comma after wa in any case. Roko isn't a word, though rouko and roukou are (and don't make sense here). Besa isn't a verb ending, even in classical Japanese, and I can't imagine what it's supposed to mean. Enraka isn't a word, and koen isn't a word though kouen is... Hm. It's gibberish anyway, so it wouldn't matter if it came through. How's your Spanish, Rick? Try watashi wa as Japanese and roko en ra kabesa as Spanish... (keeping in mind that Japanese doesn't distinguish between r and l, of course.) Best regards, James Kass. ​
Re: Is there Unicode mail out there?
On Fri, 13 Jul 2001, [EMAIL PROTECTED] wrote: From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]] Those are MOJIBAKE for my SIG. Which is what you deserve for not sending UTF-8. Until you upgrade your mailer, your name wil be ?@?š‚¶‚イ‚¢‚Á‚¿‚á‚ñ?š. :-p No way. Any mail client that is sufficiently clever to understand UTF-8 should understand all valid and registered MIME-charsets. After all, conversion libraries are both widely available and easy to use. [I can see you put a smiley after your statement so I realise you were probably being sarcastic, but I thought that this could bear pointing out.] All the `all messages should be in UTF-8, even when there are well-established legacy encodings that cover the characters of a given message' mumbo-jumbo that has been mentioned recently on the list is really just so much hot air. Firstly, mail clients will not be able to deprecate support for other charsets even if UTF-8 is widely adopted (which it isn't--for email) because of the need to be able to interpret the masses of existing messges. Secondly, maintaining such support is, as pointed out above, extremely easy to do. Thirdly, there are a great number of clients out there that do not support UTF-8 and are unlikely to do so in the immediate future, either because of internal limitations in the software that are hard to remove or because people don't upgrade. I think it's antisocial to say `Well, I _could_ have used a charset that would have enabled you to read my message but I decided not to, for no particularly good reason.' On the other hand it makes sense to say `Sorry, but UTF-8 is the only charset that will do since I wanted to use Etruscan, Russian and Japanese characters and UTF-8 is the only sane way to do this.' That's the only benefit that Unicode and UTF-8 will bring to email: the ability to mix and match characters from all scripts of all sizes and shapes in a single message. OTOH, for those of us who need this it's a big advantage. Another thing that some people may worry about is the bad interaction between quoted-unprintable and UTF-8 (or any non-West European / North American coding in general, but for UTF-8 it's even worse): 6 bytes for a single Cyrillic character? Ye gods. [I could start another rant about how bad an idea QP was in the first place, but that's off-topic here.] -- Gaute Strokkeneshttp://www.srcf.ucam.org/~gs234/ I am NOT a nut
Re: Is there Unicode mail out there?
In a message dated 2001-07-11 15:03:27 Pacific Daylight Time, [EMAIL PROTECTED] writes: One exception to this should be US-ASCII because not only the repertoire of US-ASCII is a subset of the repertoire of UTF-8 but also the representation of all characters in US-ASCII is identical in UTF-8. A smart mail client would notice that all characters are in US-ASCII repertoire and label outgoing messages as in US-ASCII EVEN if it's configured to label outgoing messages in UTF-8 [...] I thought this might even be enshrined in an RFC. It certainly makes sense. If you are using a mailer that sends CP1252 down the wire (not that this is a good idea, but some mailers do this), the mailer should examine the message and if it only contains US-ASCII characters, the message should be tagged as US-ASCII. Otherwise, if it only contains ISO 8859-1, it should be tagged as ISO 8859-1. Only if it actually contains CP1252 characters, like smart quotes or long dashes, should it be tagged as CP1252. As Jungshik observed, the same goes for UTF-8. -Doug Ewell Fullerton, California
Re: Is there Unicode mail out there?
Please disregard my previous message about a work-around for Outlook Express problem. Although it works, non-UTF-8 messages are no longer being properly displayed, an unacceptable trade-off. Another possibility which was tested was to add an innocuous character which isn't included in any code page to the signature. Tried the zero-width space. When copying the zero-width space into the signature of a message being sent in reply to a message encoded as Thai (Windows), Outlook Express prompted to Send as Unicode... when the letter was tagged to be sent later. So far, so good. Figured it would be possible to set up a signature with ZWS to eliminate the necessity of manually changing the encoding of messages being sent to UTF-8 every time a message is sent. Unfortunately, on Windows M.E., the signature information is stored in the Registry, and it's ASCII. So, the ZWS got converted to a question mark and doesn't get switched back when it's added to a message. So, tried setting up a signature file to be added to each outgoing message including the ZWS. In this case, MSOE displays the UTF-8 ZWS as mojibake (gibberish) when the signature is added to the outgoing message. Perhaps a future version of Outlook will correct the problem. Best regards, James Kass.
Re: Is there Unicode mail out there?
[EMAIL PROTECTED] wrote: In a message dated 2001-07-11 15:03:27 Pacific Daylight Time, [EMAIL PROTECTED] writes: One exception to this should be US-ASCII because not only the repertoire of US-ASCII is a subset of the repertoire of UTF-8 but also the representation of all characters in US-ASCII is identical in UTF-8. A smart mail client would notice that all characters are in US-ASCII repertoire and label outgoing messages as in US-ASCII EVEN if it's configured to label outgoing messages in UTF-8 [...] I thought this might even be enshrined in an RFC. It certainly makes sense. If you are using a mailer that sends CP1252 down the wire (not that this is a good idea, but some mailers do this), the mailer should examine the message and if it only contains US-ASCII characters, the message should be tagged as US-ASCII. The RFCs/BCPs do encourage using as minimal a charset as possible. Anyway, UTF-8 email is nowhere right now. Kat Momoi of Netscape has suggested that about the only this could change is if email client vendors turn it on by default in new product releases. I won't be the first! Having done a lot of email client programming using the RFCs as a basis, let me say that in general RFCs are vague, and not always the best practice for interoperability when it comes to email. For example, CRLF in message bodies is recommended, but actually reduces interoperability, particularly with subversions of IE 5. So I don't know of any email client that does it. And quoted-printable is way too complicated to expect conforming implementations. And don't get me started about all the random charsets that RFCs promote that nobody adopts! James.
Re: Is there Unicode mail out there?
Here's a work-around that seems to work. Added the ZWS after the signature in a signature file. Because the mojibake for ZWS includes the Euro currency symbol, OE prompts to 'send as Unicode' when replying to a non-UTF-8 sender. Of course, the time saved by not having to manually change the encoding will probably be less than the time lost explaining what the junk is under my name. Best regards, James Kass. ​
Re: Is there Unicode mail out there?
On Thu, 12 Jul 2001, James Kass wrote: Here's a work-around that seems to work. Added the ZWS after the signature in a signature file. Because the mojibake for ZWS includes the Euro currency symbol, OE prompts to 'send as Unicode' when replying to a non-UTF-8 sender. Mysterious is why this prompting (by MS OE) did not happen to Mike Ayers when he replied to Peter's message with Thai string in Windows-874 adding some Chinese characters while MS OE (5.50.x) I tried certainly prompted me to pick one of three (1. send as Unicode, 2. send as is - in Windows-874 - risking loss of info. 3. cancel) when I did the same thing. ZWS and Chinese characters have no reason to be treated differently when added to a Windows-874 encoded message. BTW, Mozilla/Netscape 6 also uses the encoding of the message (or its closest match among IANA-registered MIME charsets. Thus, in place of Windows-874, Mozilla/Netscape 6 uses TIS-620) you're replying to by default. When one adds some characters outside the repertoire of that encoding, it warns that there are some characters not representable in the current encoding and it's necessary to change the encoding to something that can represent all characters. (it does not suggest Unicode.) It offers two options : go ahead despite potential loss of some characters or cancel and change the encoding. Perhaps, both Mozilla/Netscape 6 and MS OE should have an option ( 'toggle-switchable') to let users specify that their preferred encoding (set in preference) be used by default regardless of the encoding of messages they're replying to. Jungshik Shin
Re: Is there Unicode mail out there?
Hmm, it didn't work either. OK, one more try -- Thai test, take 3: กลัปมาอยู่แล้ว - Peter --- Peter Constable Non-Roman Script Initiative, SIL International 7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA Tel: +1 972 708 7485 E-mail: [EMAIL PROTECTED]
Re: Is there Unicode mail out there?
(I didnt read all the thread so maybe I missed a step). So the proposal is that minimizing the charset is a good thing? This means that you and I start out in a conversation about a product I am trying to sell you, it happens to be all in ascii and we exchange several mails successfully. Then I quote you a price in Euros and my 1252 message gets corrupted by your reader which can handle either only 8859-1 or ASCII, and you miss the fact that the Euro is corrupted and think we are talking dollars or some other currency. Although I understand why you would want a minimal charset in order to not needlessly prevent communications, the implication of reliability and trust that is built by having some success is a problem. You think you are communicating successfully but when it is critical it may not... Perhaps if a harder line was taken when characters are used that cannot be converted, this would make more sense. (ie give a very clear recognizable indication of corruption or conversion failures) tex [EMAIL PROTECTED] wrote: In a message dated 2001-07-11 15:03:27 Pacific Daylight Time, [EMAIL PROTECTED] writes: One exception to this should be US-ASCII because not only the repertoire of US-ASCII is a subset of the repertoire of UTF-8 but also the representation of all characters in US-ASCII is identical in UTF-8. A smart mail client would notice that all characters are in US-ASCII repertoire and label outgoing messages as in US-ASCII EVEN if it's configured to label outgoing messages in UTF-8 [...] I thought this might even be enshrined in an RFC. It certainly makes sense. If you are using a mailer that sends CP1252 down the wire (not that this is a good idea, but some mailers do this), the mailer should examine the message and if it only contains US-ASCII characters, the message should be tagged as US-ASCII. Otherwise, if it only contains ISO 8859-1, it should be tagged as ISO 8859-1. Only if it actually contains CP1252 characters, like smart quotes or long dashes, should it be tagged as CP1252. As Jungshik observed, the same goes for UTF-8. -Doug Ewell Fullerton, California -- --- Tex Texin Director, International Business mailto:[EMAIL PROTECTED] +1-781-280-4271 Fax:+1-781-280-4655 the Progress Company 14 Oak Park, Bedford, MA 01730 ---
Re: Is there Unicode mail out there?
My other e-mail was a real "moji-baka", I'd say. That would be a good term, $BJ8;zGO(B: Re: Is there Unicode mail out there? (I didnt read all the thread so maybe I missed a step). So the proposal is that minimizing the charset is a good thing? This means that you and I start out in a conversation about a product I am trying to sell you, it happens to be all in ascii and we exchange several mails successfully. Then I quote you a price in Euros and my 1252 message gets corrupted by your reader which can handle either only 8859-1 or ASCII, and you miss the fact that the Euro is corrupted and think we are talking dollars or some other currency. Although I understand why you would want a minimal charset in order to not needlessly prevent communications, the implication of reliability and trust that is built by having some success is a problem. You think you are communicating successfully but when it is critical it may not... Perhaps if a harder line was taken when characters are used that cannot be converted, this would make more sense. (ie give a very clear recognizable indication of corruption or conversion failures) tex [EMAIL PROTECTED] wrote: In a message dated 2001-07-11 15:03:27 Pacific Daylight Time, [EMAIL PROTECTED] writes: One exception to this should be US-ASCII because not only the repertoire of US-ASCII is a subset of the repertoire of UTF-8 but also the representation of all characters in US-ASCII is identical in UTF-8. A smart mail client would notice that all characters are in US-ASCII repertoire and label outgoing messages as in US-ASCII EVEN if it's configured to label outgoing messages in UTF-8 [...] I thought this might even be enshrined in an RFC. It certainly makes sense. If you are using a mailer that sends CP1252 down the wire (not that this is a good idea, but some mailers do this), the mailer should examine the message and if it only contains US-ASCII characters, the message should be tagged as US-ASCII. Otherwise, if it only contains ISO 8859-1, it should be tagged as ISO 8859-1. Only if it actually contains CP1252 characters, like smart quotes or long dashes, should it be tagged as CP1252. As Jungshik observed, the same goes for UTF-8. -Doug Ewell Fullerton, California -- --- Tex Texin Director, International Business mailto:[EMAIL PROTECTED] +1-781-280-4271 Fax:+1-781-280-4655 the Progress Company 14 Oak Park, Bedford, MA 01730 ---
Re: Is there Unicode mail out there?
On Thu, 12 Jul 2001 [EMAIL PROTECTED] wrote: Hmm, it didn't work either. OK, one more try -- Thai test, take 3: กลัปมาอยู่แล้ว Finally, you succeeded ! Congratulations :-). Could you explain what you did differently this time so that other Lotus Notes users can benefit from your experience/experiment? Jungshik Shin
RE: Is there Unicode mail out there?
From: Jungshik Shin [mailto:[EMAIL PROTECTED]] Mysterious is why this prompting (by MS OE) did not happen to Mike Ayers when he replied to Peter's message with Thai string in Windows-874 adding some Chinese characters while MS OE (5.50.x) I tried certainly prompted me to pick one of three (1. send as Unicode, 2. send as is - in Windows-874 - risking loss of info. 3. cancel) when I did the same thing. ZWS and Chinese characters have no reason to be treated differently when added to a Windows-874 encoded message. Not mysterious really, I'm using Outlook, not Outlook Express. Despite the similarity of names, the differences seem to be considerable. It is disturbing, though, that the premium product has less desireable behavior than the free one in this case. /|/|ike
Re: Is there Unicode mail out there?
Jungshik Shin wrote: Perhaps, both Mozilla/Netscape 6 and MS OE should have an option ( 'toggle-switchable') to let users specify that their preferred encoding (set in preference) be used by default regardless of the encoding of messages they're replying to. It would be nice... MS OE appeared to already have the option. Under Tools-Options- Send, there's a check-box for Reply to messages using the format in which they were sent. Under Tools-Options-Send-International Settings, there's a provision for the user to choose a default encoding and a check-box to Use the following default encoding for outgoing messages:. Even though this system was set up accordingly, outgoing messages which were replies to messages in non-UTF-8 encodings weren't being sent in UTF-8, to my surprise, chagrin, and dismay. Best regards, James Kass. ​
RE: Is there Unicode mail out there?
In any case, no matter if new message or reply or forward, you can force OE to use a specific encoding using the Format.Encoding menu. There is no option to ALWAYS use a specific encoding in replies and forwards, you will have to choose manually each time. OE itself has no option to automatically determine the best outbound encoding (and I agree that generally the encoding with the smallest repertoire is the best). OE will only suggest UTF-8 and will not suggest any other charset, if the chosen encoding does not hold the characters used. Note: an HTML message to an HTML4 capable recipient will transport any character regardless of the chosen encoding. That might explain the different results you are seeing when sending to differently enabled recipients. Replying in the charset of the original message is in my view reasonable behavior: the recipient of your reply has the best chance to read the message in the encoding the original message was sent. Changing the encoding decreases the chance the replyee will be able to read your message. -Original Message- From: James Kass [mailto:[EMAIL PROTECTED]] Sent: Thursday, July 12, 2001 1:18 PM To: Jungshik Shin Cc: Unicode List Subject: Re: Is there Unicode mail out there? Jungshik Shin wrote: Perhaps, both Mozilla/Netscape 6 and MS OE should have an option ( 'toggle-switchable') to let users specify that their preferred encoding (set in preference) be used by default regardless of the encoding of messages they're replying to. It would be nice... MS OE appeared to already have the option. Under Tools-Options- Send, there's a check-box for Reply to messages using the format in which they were sent. Under Tools-Options-Send-International Settings, there's a provision for the user to choose a default encoding and a check-box to Use the following default encoding for outgoing messages:. Even though this system was set up accordingly, outgoing messages which were replies to messages in non-UTF-8 encodings weren't being sent in UTF-8, to my surprise, chagrin, and dismay. Best regards, James Kass.
RE: Is there Unicode mail out there?
From: Chris Wendt [mailto:[EMAIL PROTECTED]] Replying in the charset of the original message is in my view reasonable behavior: the recipient of your reply has the best chance to read the message in the encoding the original message was sent. Changing the encoding decreases the chance the replyee will be able to read your message. For person-to-person emails, this makes sense. It does not hold up for mailing lists, however - it's not necessarily unreasonable behavior, but the odds of readability for mailing lists are fixed to the character set, regardless of the character set used in any individual mailing (note that the Windows Thai character set could not be viewed by many people - changed to UTF-8, almost everyone could read it). For this reason, I would really like to see option controlled behavior (use the current behavior as a default). /|/|ike
Re: Is there Unicode mail out there?
Chris Wendt wrote: Replying in the charset of the original message is in my view reasonable behavior: the recipient of your reply has the best chance to read the message in the encoding the original message was sent. Changing the encoding decreases the chance the replyee will be able to read your message. When a user issues an instruction to a computer, it is a command rather than a request. If a user selects the option to Use the following default encoding for outgoing messages:, then the expected behavior is compliance. Of course, you are quite right in that the recipient is more likely to be able to read a message sent in the recipient's default. As we move towards a World encoding standard, perhaps more applications will use the standard as default. This message is being sent in Arabic (Windows) because it is in reponse to a message sent in that encoding. The author of the original message has noted my work-around and has cleverly prevented it by selecting a code-page which includes the special character I'm using for the kludge. Best regards, James Kass.
Re: Is there Unicode mail out there?
[EMAIL PROTECTED] had asked: Is there Unicode mail out there? On Sun, 8 Jul 2001 03:40:51 -0700 James Kass wrote: Microsoft's Outlook Express offers many e-mail encoding options, including Unicode (UTF-8) and responding to the sender in the same encoding as the sender's message. And, it won't cost you money. Same with Netscape 6.01; though it still has some teething problems. Best wishes, Otto Stolz
Re: Is there Unicode mail out there?
Can you read this? This is coming from Lotus Notes. Otto Stolz [EMAIL PROTECTED] on 07/11/2001 10:43:10 PM Please respond to Otto Stolz [EMAIL PROTECTED] To: Unicode List [EMAIL PROTECTED] cc: 11 [EMAIL PROTECTED] (bcc: Dutta Abhijit/India/IBM) Subject: Re: Is there Unicode mail out there? [EMAIL PROTECTED] had asked: Is there Unicode mail out there? On Sun, 8 Jul 2001 03:40:51 -0700 James Kass wrote: Microsoft's Outlook Express offers many e-mail encoding options, including Unicode (UTF-8) and responding to the sender in the same encoding as the sender's message. And, it won't cost you money. Same with Netscape 6.01; though it still has some teething problems. Best wishes, Otto Stolz
Re: Is there Unicode mail out there?
From: [EMAIL PROTECTED] Can you read this? This is coming from Lotus Notes. Yes, it looks like you are confused (all those question marks!) MichKa Michael Kaplan Trigeminal Software, Inc. http://www.trigeminal.com/
Re: Is there Unicode mail out there?
On Wed, 11 Jul 2001 15:41:28 +0530, [EMAIL PROTECTED] wrote: Can you read this? It's 8 question marks in a row. I don't know what you had expected. Note that you have sent these header fields: Mime-Version: 1.0 Content-type: text/plain; charset=us-ascii This announces 7-bit ASCII, cf. http://czyborra.com/charsets/iso646.html, so your message cannot contain any other character. This is coming from Lotus Notes. I haven't tried Lotus Notes, so I cannot tell you how to persuade it to send non-ASCII text (with proper headers, of course), or whether this is possible at all. Best wishes, Otto Stolz
Re: Is there Unicode mail out there?
Michael Kaplan wrote: From: [EMAIL PROTECTED] Can you read this? This is coming from Lotus Notes. Yes, it looks like you are confused (all those question marks!) Maybe it's the Lotus Notes that's confused rather than Dutta Abhijit? Best regards, James Kass.
Re: Is there Unicode mail out there?
In a message dated 2001-07-11 3:26:25 Pacific Daylight Time, [EMAIL PROTECTED] writes: Can you read this? This is coming from Lotus Notes. 'Nuff said. I received this with CompuServe 5.0 (similar to AOL 5.0, imagine that). I don't know if CompuServe 6.0 is any better. -Doug Ewell Fullerton, California
Re: Is there Unicode mail out there?
Can you read this? This is coming from Lotus Notes. Notes can handle Unicode characters, at least going from one Notes user to another within our system. Once it goes out on to the Internet, there may be other processes intervening that munge the data. In the Basics tab of the User Preferences dialog (I'm using R5.0.5), under Additional Options I've got Enable Unicode display enabled; under the Main and News tab, in the Multilingual Internet Mail drop down, select Use Unicode (UTF-8). Just as a test, here's a bit of Thai: ¡ÅÑ»ÁÒÍÂÙèáÅéÇ - Peter --- Peter Constable Non-Roman Script Initiative, SIL International 7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA Tel: +1 972 708 7485 E-mail: [EMAIL PROTECTED]
RE: Is there Unicode mail out there?
After all the various replies that say gosh, I can't read this, I thought it might be helpful to point out this section of Abijit's email headers: Content-type: text/plain; charset=us-ascii The outbound mailer (even in Notes, which is a pretty well internationalized application, although they bury the settings that control this specific capability!!) can send UTF-8, as far as I remember, plus a raft of legacy encodings. In this case either the user's mail client or the mailer itself is set to send US-ASCII. Since I don't have Notes installed these days, I can't say where the controls are that change the settings (I certainly don't remember), but I do recall that I was able, as a Notes user in the past, to set my encoding. That would quite possibly make the string of eight unknown characters visible to the list. Note that this has nothing to do with which mailer you are receiving the mail with or with Sarasvati's capabilities or anything: the message was converted to nothing before it left the sender. In most cases in my recent experience, settings on the mailer or mail client itself prevent a proper Unicode message from being generated. The mailers themselves rarely care about the encoding: as long as it obeys RFCs 822/1341/1342 they are happy. Most of the more modern GUI mail clients can handle UTF-8. Yes, there are older or text-mode clients that can't deal with it, but in my experience it is getting to the point that there are (generally, generally) more problems with getting the settings set to send than with receivers receiving! Best Regards, Addison Addison P. Phillips Globalization Architect / Manager, Globalization Engineering webMethods, Inc. 432 Lakeside Drive, Sunnyvale, CA +1 408.962.5487 (phone) +1 408.210.3659 (mobile) - Internationalization is an architecture. It is not a feature. -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On Behalf Of [EMAIL PROTECTED] Sent: Wednesday, July 11, 2001 3:11 AM To: Otto Stolz Cc: Unicode List Subject: Re: Is there Unicode mail out there? Can you read this? This is coming from Lotus Notes. Otto Stolz [EMAIL PROTECTED] on 07/11/2001 10:43:10 PM Please respond to Otto Stolz [EMAIL PROTECTED] To: Unicode List [EMAIL PROTECTED] cc: 11 [EMAIL PROTECTED] (bcc: Dutta Abhijit/India/IBM) Subject: Re: Is there Unicode mail out there? [EMAIL PROTECTED] had asked: Is there Unicode mail out there? On Sun, 8 Jul 2001 03:40:51 -0700 James Kass wrote: Microsoft's Outlook Express offers many e-mail encoding options, including Unicode (UTF-8) and responding to the sender in the same encoding as the sender's message. And, it won't cost you money. Same with Netscape 6.01; though it still has some teething problems. Best wishes, Otto Stolz
Re: Is there Unicode mail out there?
Yes, that works fine. The Thai comes through clearly: ¡ÅÑ»ÁÒÍÂÙèáÅéÇ Mark - Original Message - From: [EMAIL PROTECTED] To: Unicode List [EMAIL PROTECTED] Sent: Wednesday, July 11, 2001 09:33 Subject: Re: Is there Unicode mail out there? Can you read this? This is coming from Lotus Notes. Notes can handle Unicode characters, at least going from one Notes user to another within our system. Once it goes out on to the Internet, there may be other processes intervening that munge the data. In the Basics tab of the User Preferences dialog (I'm using R5.0.5), under Additional Options I've got Enable Unicode display enabled; under the Main and News tab, in the Multilingual Internet Mail drop down, select Use Unicode (UTF-8). Just as a test, here's a bit of Thai: ¡ÅÑ»ÁÒÍÂÙèáÅéÇ - Peter -- - Peter Constable Non-Roman Script Initiative, SIL International 7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA Tel: +1 972 708 7485 E-mail: [EMAIL PROTECTED]
RE: Is there Unicode mail out there?
From: Mark Davis [mailto:[EMAIL PROTECTED]] Yes, that works fine. The Thai comes through clearly: ¡ÅÑ»ÁÒÍÂÙèáÅéÇ Woohoo!!! UTF-8 party!!! ???!!! /|/|ike
Re: Is there Unicode mail out there?
On Wed, 11 Jul 2001 [EMAIL PROTECTED] wrote: In the Basics tab of the User Preferences dialog (I'm using R5.0.5), under Additional Options I've got Enable Unicode display enabled; under the Main and News tab, in the Multilingual Internet Mail drop down, select Use Unicode (UTF-8). Just as a test, here's a bit of Thai: ¡ÅÑ»ÁÒÍÂÙèáÅéÇ Your mail has the following header, which indicates that it's in 'Windows-874' encoding. I'm not sure whether that encoding name is registered with IANA for use in MIME. X-Mailer: Lotus Notes Release 5.0.5 September 22, 2000 MIME-Version: 1.0 Content-type: text/plain; charset=Windows-874 ^^^ Anyway, to get your message properly recognized as in UTF-8 by other MIME-compliant mail programs (MS OE and Netscape 6.x/Mozilla, Pine, Mutt, etc), you have to find a way to make Lotus Notes add the correct MIME header for UTF-8 message as shown below: Content-type: text/plain; charset=UTF-8 Content-Transfer-Encoding: (8bit|base64|quoted-printable) I'm not sure if that's possible in Lotus Notes, though. MS OE and Netscape 6.x/Mozilla, Mutt and Pine work well with UTF-8 messages (for the latter two, obviously you need to have a terminal to support UTF-8) Jungshik Shin
Re: Is there Unicode mail out there?
Yes, that works fine. The Thai comes through clearly: ¡ÅÑ»ÁÒÍÂÙèáÅéÇ And my own message came back to me with the Thai as I originally sent it. So, I'm getting UTF-8 going out and coming in with nothing messing it up in between. If other Notes users aren't getting the same results, check the version of your client (I don't know if R4.x could handle Unicode or not), and check your preferences. - Peter --- Peter Constable Non-Roman Script Initiative, SIL International 7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA Tel: +1 972 708 7485 E-mail: [EMAIL PROTECTED]
Re: Is there Unicode mail out there?
On Wed, 11 Jul 2001, Mark Davis wrote: - Original Message - From: [EMAIL PROTECTED] Sent: Wednesday, July 11, 2001 09:33 Main and News tab, in the Multilingual Internet Mail drop down, select Use Unicode (UTF-8). Just as a test, here's a bit of Thai: à¸à¸¥à¸±à¸à¸¡à¸²à¸à¸¢à¸¹à¹à¹à¸¥à¹à¸§ Yes, that works fine. The Thai comes through clearly: ¡ÅÑ»ÁÒÍÂÙèáÅéÇ Well, it was not in UTF-8, though. It was encoded in Windows-874 (for Thai) and was flagged as such in Content-Type header of the message. Conetnt-Type: text/plain; charset=Windows-874 In my previous response, I thought actual encoding used was UTF-8, but Lotus Notes put the incorrect charset parameter value in C-T header. That turned out not to be the case. At least, there's NO inconsistency between what's used in the message body and what the message header indicated was used in the message body. I'm writing this email with Pine running inside UTF-8 enabled xterm with the following line added to display filter spec. of my pinerc (Pine configuration file) _CHARSET(Windows-874)_ /usr/bin/iconv -f CP874 -t UTF-8 Unlike my previous message (which include Windows-874 encoded string in Thai but marked as in UTF-8 because I thought that Thai string was in UTF-8), this message should have Thai string encoded in UTF-8 (as indicated by C-T header). Jungshik Shin
RE: Is there Unicode mail out there?
Okay, I sent these as UTF-8, with some Chinese where the question marks are. However, the Chinese is getting eaten somewhere along the way. Oddly, though, the Thai still displays fine. Would any Outlook XP guru volunteer to help me get back to my international ways? Final test: From: Ayers, Mike [mailto:[EMAIL PROTECTED]] Let's try this again... From: Mark Davis [mailto:[EMAIL PROTECTED]] Yes, that works fine. The Thai comes through clearly: ¡ÅÑ»ÁÒÍÂÙèáÅéÇ Woohoo!!! UTF-8 party!!! ???!!! /|/|ike
RE: Is there Unicode mail out there?
Let's try this again... From: Mark Davis [mailto:[EMAIL PROTECTED]] Yes, that works fine. The Thai comes through clearly: ¡ÅÑ»ÁÒÍÂÙèáÅéÇ Woohoo!!! UTF-8 party!!! ???!!! /|/|ike
Re: Is there Unicode mail out there?
Unicode (UTF-8). Just as a test, here's a bit of Thai: ¡Ƒ»ڨ↩ō Your mail has the following header, which indicates that it's in 'Windows-874' encoding. I'm not sure whether that encoding name is registered with IANA for use in MIME. X-Mailer: Lotus Notes Release 5.0.5 September 22, 2000 MIME-Version: 1.0 Content-type: text/plain; charset=Windows-874 OK, I didn't look closely at the header, just at the result. Here's another test that will be telling - I don't know of any codepage / charset for Ethiopic: ሀሁሂሃ - Peter --- Peter Constable Non-Roman Script Initiative, SIL International 7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA Tel: +1 972 708 7485 E-mail: [EMAIL PROTECTED]
RE: Is there Unicode mail out there?
On Wed, 11 Jul 2001, Ayers, Mike wrote: Okay, I sent these as UTF-8, with some Chinese where the question marks are. However, the Chinese is getting eaten somewhere along the way. Oddly, though, the Thai still displays fine. Would any Outlook XP guru volunteer to help me get back to my international ways? Final test: Nothing cryptic. As with others on this thread, your problem is to mistake Windows-874 (legacy encoding for Thai) for UTF-8. Because Windows-874 does NOT cover Chinese characters, they turned into '?'. Judging from your message hader, you're not using MS OE but something different. X-Mailer: Internet Mail Service (5.5.2653.19) Content-Type: text/plain; charset=windows-874 Content-Transfer-Encoding: 8bit MS OE 5.x is smart enough to detect characters (in your reply. in this case Chinese characters) not covered by the repertoire of MIME charrset (in this case, Windows-874) of the message you're replying to (by default, whch is also the MIME charset of your reply) and to prompt users to answer whether to use UTF-8 or not explaining that some of characters are not representable in the default encoding (the encoding of the message you're replying to) and will be lost. You can also configure MS OE to always use UTF-8 (or whatever encoding of your choice) regardless of the encoding of messages you're replying to. From: Ayers, Mike [mailto:[EMAIL PROTECTED]] Let's try this again... From: Mark Davis [mailto:[EMAIL PROTECTED]] Yes, that works fine. The Thai comes through clearly: กลัปมาอยู่แล้ว Woohoo!!! UTF-8 party!!! ???!!! No, it should have been Windows-874 party !! :-). Both Mark Davis and Peter Constable sent messages in Windows-874 beleiving that they're using UTF-8. However, I'm sending this in UTF-8 (after automatic conversion by my mail client, Pine 4.33). Jungshik Shin
Re: Is there Unicode mail out there?
In a message dated 2001-07-11 13:31:54 Pacific Daylight Time, [EMAIL PROTECTED] writes: OK, I didn't look closely at the header, just at the result. Here's another test that will be telling - I don't know of any codepage / charset for Ethiopic: ሀáˆáˆ‚ሃ Everything came out fine. Of course, what I saw was the raw bytes, interpreted as CP1252, but I just cut and pasted them into SC UniPad and everything came out fine (except for the fact that UniPad doesn't have Ethiopic glyphs yet...). The header revealed the encoding Peter used: Content-type: text/plain; charset=UTF-8 -Doug Ewell Fullerton, California
RE: Is there Unicode mail out there?
I think you'll find that Peter's response applies to you too: the mailer is seeing Windows-874 on the incoming message and converting your outgoing message to use that same encoding (in a bid to be compatible with the original message). Outlook has done that for awhile. If you manually set the encoding for the reply you can override that behavior. In Outlook 2000 this is Format | Encoding Best Regards, Addison -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On Behalf Of Ayers, Mike Sent: Wednesday, July 11, 2001 12:42 PM To: Unicode List Subject: RE: Is there Unicode mail out there? Okay, I sent these as UTF-8, with some Chinese where the question marks are. However, the Chinese is getting eaten somewhere along the way. Oddly, though, the Thai still displays fine. Would any Outlook XP guru volunteer to help me get back to my international ways? Final test: From: Ayers, Mike [mailto:[EMAIL PROTECTED]] Let's try this again... From: Mark Davis [mailto:[EMAIL PROTECTED]] Yes, that works fine. The Thai comes through clearly: กลัปมาอยู่แล้ว Woohoo!!! UTF-8 party!!! ???!!! /|/|ike
Re: Is there Unicode mail out there?
Now the question is whether it's possible to force Lotus Notes to use UTF-8 as the encoding of the outgoing message EVEN WHEN characters in the message are all covered by existing encoding other than UTF-8 (e.g. Windows-874 for Thai). Well, I'm going to try one more thing -- Thai test, take 2: ¡ÅÑ»ÁÒÍÂÙèáÅéÇ - Peter --- Peter Constable Non-Roman Script Initiative, SIL International 7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA Tel: +1 972 708 7485 E-mail: [EMAIL PROTECTED]
Re: Is there Unicode mail out there?
On Wed, 11 Jul 2001 [EMAIL PROTECTED] wrote: Unicode (UTF-8). Just as a test, here's a bit of Thai: ¡Ƒ»ڨ↩ō Your mail has the following header, which indicates that it's in 'Windows-874' encoding. I'm not sure whether that encoding name is registered with IANA for use in MIME. X-Mailer: Lotus Notes Release 5.0.5 September 22, 2000 MIME-Version: 1.0 Content-type: text/plain; charset=Windows-874 OK, I didn't look closely at the header, just at the result. Here's another test that will be telling - I don't know of any codepage / charset for Ethiopic: ሀሁሂሃ Yes, this time you made it :-) X-Mailer: Lotus Notes Release 5.0.5 September 22, 2000 MIME-Version: 1.0 Content-type: text/plain; charset=UTF-8 Now the question is whether it's possible to force Lotus Notes to use UTF-8 as the encoding of the outgoing message EVEN WHEN characters in the message are all covered by existing encoding other than UTF-8 (e.g. Windows-874 for Thai). One exception to this should be US-ASCII because not only the repertoire of US-ASCII is a subset of the repertoire of UTF-8 but also the representation of all characters in US-ASCII is identical in UTF-8. A smart mail client would notice that all characters are in US-ASCII repertoire and label outgoing messages as in US-ASCII EVEN if it's configured to label outgoing messages in UTF-8 (or any superset of US-ASCII like EUC-KR, ISO-2022-JP, GB2312-80 - a better term is certainly EUC-CN but it's not registered with IANA and GB2312-80 got too widely-spread beyond remedy-, ISO8859-[1-9,15]). There's no violation of standards in NOT doing this, but doing this would for sure reduce the possibility of unnecessary 'red-flag' raised by some mail clients on the recipient's side. Unfortunately, MS OE and Netscape-Mail are not smart in this regard while Pine and Mutt are. Jungshik Shin P.S.How about making a sort of resolution to recommend that anybody writing to this list should use UTF-8 *if /when* possible? This was suggested in the past, but we're still getting a lot of messages in ISO-8859-1 and other encodings.
RE: Is there Unicode mail out there?
From: Jungshik Shin [mailto:[EMAIL PROTECTED]] Nothing cryptic. As with others on this thread, your problem is to mistake Windows-874 (legacy encoding for Thai) for UTF-8. Because Windows-874 does NOT cover Chinese characters, they turned into '?'. Judging from your message hader, you're not using MS OE but something different. I am using OE, set to UTF-8. If I mail Chinese to myself, all is well. X-Mailer: Internet Mail Service (5.5.2653.19) Content-Type: text/plain; charset=windows-874 Content-Transfer-Encoding: 8bit Odd. Perhaps our post office is changing things. No, it should have been Windows-874 party !! :-). Both Mark Davis and Peter Constable sent messages in Windows-874 beleiving that they're using UTF-8. Perhaps, like me, they sent messages in UTF-8 and had them converted to Windows-874 without consent. :-( However, I'm sending this in UTF-8 (after automatic conversion by my mail client, Pine 4.33). I also received it as UTF-8. Addison I think you'll find that Peter's response applies to you too: the mailer is seeing Windows-874 on the incoming message and converting your outgoing message to use that same encoding (in a bid to be compatible with the original message). Outlook has done that for awhile. If you manually set the encoding for the reply you can override that behavior. In Outlook 2000 this is Format | Encoding /Addison Mine already says UTF-8. Test again: 你好吗? /|/|ike
Re: Is there Unicode mail out there?
In a message dated 2001-07-11 15:03:27 Pacific Daylight Time, [EMAIL PROTECTED] writes: P.S.How about making a sort of resolution to recommend that anybody writing to this list should use UTF-8 *if /when* possible? This was suggested in the past, but we're still getting a lot of messages in ISO-8859-1 and other encodings. Believe me, I would if I could. -Doug Ewell Fullerton, California
Re: Is there Unicode mail out there?
Mike Ayers wrote: Okay, I sent these as UTF-8, with some Chinese where the question marks are. However, the Chinese is getting eaten somewhere along the way. Oddly, though, the Thai still displays fine. Would any Outlook XP guru volunteer to help me get back to my international ways? Final test: On Outlook Express 5 [Tools] - [Options] - [Read] - [Fonts] - (Unicode) - {Select appropriate fonts} - {Set as Default} - then - [Tools] - [Options] - [Read] - [International Settings] - {Check the box marked 'Use default encoding for all...'} This seems to work-around the distressing practice of the program automatically replying to senders in the sender's default rather than the user's preference. Possibly there are other settings under the [Send] and/or [Compose] tabs that might also have to be adjusted. On this system, the 'reply to senders using the senders format' field was unchecked, yet my replies to earlier message in the thread were being sent as Thai (Windows). Best regards, James Kass.
Re: Is there Unicode mail out there?
On Wed, 11 Jul 2001 [EMAIL PROTECTED] wrote: Now the question is whether it's possible to force Lotus Notes to use UTF-8 as the encoding of the outgoing message EVEN WHEN characters in the message are all covered by existing encoding other than UTF-8 (e.g. Windows-874 for Thai). Well, I'm going to try one more thing -- Thai test, take 2: กลัปมาอยู่แล้ว Hmm, it didn't work either. Even though you're replying to my message in UTF-8 (and clearly labeled as such in Content-Type header) with some Ethiopian characters (which were removed in your reply), Lotus Notes silently (without your consent) fell back to Windows-874 when you added some Thai characters in your reply. You may have to do some more digging to find an option/switch buried deep inside to make Lotus Notes use UTF-8 no matter what (or when you want) (instead of using the 'smallest??' encoding that covers all characters in your outgoing messages). It seems like Lotus Notes is too 'smart'.. Jungshik Shin
Re: Is there Unicode mail out there?
On Wed, 11 Jul 2001 [EMAIL PROTECTED] wrote: In a message dated 2001-07-11 15:03:27 Pacific Daylight Time, [EMAIL PROTECTED] writes: P.S.How about making a sort of resolution to recommend that anybody writing to this list should use UTF-8 *if /when* possible? This was suggested in the past, but we're still getting a lot of messages in ISO-8859-1 and other encodings. Just in case, I didn't mean to suggest an 'resolution' to force everyone to use UTF-8. I just wanted to suggest that a gentle and friendly recommendation be made as to the encoding to use for this list. Believe me, I would if I could. Apparently, you're using CompuServe. I'm not sure if it's possible to use a mail client other than one included in CompuServe 'client/browser/ whatever'. MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-Mailer: CompuServe 2000 32-bit sub 113 If what I heard is correct, it's possible to use an external mail (IMAP4 or POP3) client like Netscape 6/Mozilla and MS OE to access mail folders in CompuServe. I also heard that unlike AOL (although CompuServe and AOL are now affiliated) CompuServe has SMTP servers for subscribers to use for outgoing messages. If all I said is true, I'm wondering why you don't switch to one of 'external' mail clients I mentioned to compose your message in UTF-8. Perhaps, what I heard is not the case and that's why you can't do it. There is still an option, though, namely switching your ISP :-) (perhaps, that's not a viable option for some reason) Jungshik Shin
RE: Is there Unicode mail out there?
On Wed, 11 Jul 2001, Ayers, Mike wrote: One last time: From: Mark Davis [mailto:[EMAIL PROTECTED]] Yes, that works fine. The Thai comes through clearly: กลัปมาอยู่แล้ว Woohoo!!! UTF-8 party!!! 大家好!!! Congratulations ^-^ ! This time you clearly made it with both Thai and Chinese characters intact in UTF-8. Because either you manually change the encoding to UTF-8 in the composition window (although you're replying to a message in Windows-874) or you were replying to a message encoded in UTF-8. Jungshik Shin
Re: Is there Unicode mail out there?
On Thu, 12 Jul 2001 [EMAIL PROTECTED] wrote: In a message dated 2001-07-11 15:03:27 Pacific Daylight Time, [EMAIL PROTECTED] writes: One exception to this should be US-ASCII because not only the repertoire of US-ASCII is a subset of the repertoire of UTF-8 but also the representation of all characters in US-ASCII is identical in UTF-8. A smart mail client would notice that all characters are in US-ASCII repertoire and label outgoing messages as in US-ASCII EVEN if it's configured to label outgoing messages in UTF-8 I thought this might even be enshrined in an RFC. It certainly makes sense. If you are using a mailer that sends CP1252 down the wire (not that this is a good idea, but some mailers do this), the mailer should examine the message and if it only contains US-ASCII characters, the message should be tagged as US-ASCII. Otherwise, if it only contains ISO 8859-1, it should be tagged as ISO 8859-1. Only if it actually contains CP1252 characters, like smart quotes or long dashes, should it be tagged as CP1252. As Jungshik observed, the same goes for UTF-8. I can't say it better than you did ! While focusing on UTF-8, I forgot to mention the case involving Windows-125x, ISO-8859-x and US-ASCII. BTW, some broken/MIME-ignorant mail clients (e.g. Eudora for MS-Windows) do sorta the opposite. They mislabel outgoing messages as in ISO 8859-1 while they include characters like smart quotes and long dashes. The best would be to warn users that their messages contain those characters outside their preferred encoding and to offer a couple of options to choose from (use Unicode or other wider encodings or 'transliterate' those characters with those in the repertoire of user's preferred encoding). Short of that, at least it should label it correctly (not that I'm in favor of sending out Windows-1252 down the wire.) Jungshik Shin
Re: Is there Unicode mail out there?
$B$F$s$I$&$j$e$&$8(B asked: Is there Unicode mail out there? ... Where is there a Unicode mail? I think there is at least one out there. I hope it will not cost me money; that is for sake. That is, the U+9152 sake. Microsoft's Outlook Express offers many e-mail encoding options, including Unicode (UTF-8) and responding to the sender in the same encoding as the sender's message. And, it won't cost you money. Best regards, James Kass.
Re: Is there Unicode mail out there?
James Kass wrote: $B$F$s$I$$j$e$$8(B asked: Is there Unicode mail out there? ... Where is there a Unicode mail? I think there is at least one out there. I hope it will not cost me money; that is for sake. That is, the U+9152 sake. Email has proven to be one of the protocols slowest to adopt Unicode. Everybody is still generally using legacy encodings, especially with list servers. Sending Unicode to individuals with email clients that you know can support it is ok. Microsoft's Outlook Express offers many e-mail encoding options, including Unicode (UTF-8) and responding to the sender in the same encoding as the sender's message. And, it won't cost you money. Just your soul ...