Re: Is there Unicode mail out there?

2001-07-23 Thread James Kass


Mark Davis wrote:

 The quotation I have is from my college Greek textbook (sadly my fluency has
 reduced to essentially zero after all these years).

 Perhaps some Greeks on the list could say which is the more accurate
 formulation?

 Mark
 —

 πάντων μέτρον ἄνθρωπος — Πρωταγόρας
 [http://www.macchiato.com]


In Dictionary of Foreign Phrases and Abbreviations (Guinagh, 1965),
the following appears:

 Panton metron anthropos estin.  Gk--Man is the measure of
  all things.  Quoted by Plato, Theaetetus, 178b. 

Best regards,

James Kass.





Re: Is there Unicode mail out there?

2001-07-22 Thread Martin Duerst

Sorry - By 'pattern restrictions on mixed content' I meant a
feature in XML Schema that would allow to specify that the
mixed content in certain elements is restricted by a pattern
facet. This is a feature that isn't in XML Schema, but that
has been discussed. This would allow to define that a document
does not allow C0 control characters, a feature that would
be very important for many cases if the basic XML syntax
would start to allow C0.

Regards,   Martin.

At 10:32 01/07/19 -0600, Shigemichi Yazawa wrote:
At Thu, 19 Jul 2001 15:52:39 +0900,
Martin Duerst [EMAIL PROTECTED] wrote:
  Of course then pattern restrictions on mixed content (which we
  currently don't have) would become really helpful.

Martin,

What kind of pattern restrictions are necessary by introducing C0 NCR?
Something like this? #x1b;$B

---
Shigemichi Yazawa
[EMAIL PROTECTED]





RE: Is there Unicode mail out there?

2001-07-20 Thread Bill Kurmey

At 01:11 PM 7/19/01 -0500, Mike Ayers wrote:

   The work has to be done somewhere.  Emerging technologies must be
compatible with existing ones, and some old technologies hang around a long
time.  Really, the disallowing of control characters makes sense, since
their interpretation in so many exisiting protocols is wreak havoc upon the
unsuspecting.  You simply can't send these characters around the internet
and expect them to arrive unchanged.

Does anyone have a (list, web site, reference) which lists which C0 and C1
control codes wreak havoc upon the unsuspecting and why?  



Bill Kurmey, Edmonton, AB, Canada





RE: Is there Unicode mail out there?

2001-07-20 Thread Shigemichi Yazawa

At Thu, 19 Jul 2001 13:11:35 -0500,
Ayers, Mike [EMAIL PROTECTED] wrote:
   I'm proposing it as a convention, not a proprietary solution.  I
 agree that a standard solution would be preferred, especially Martin's
 suggestion of permitting the escape codes but not the characters.  I
 proposed the markup as a workaround until a better solution could be found.

This sounds good. Can we submit a proposition to W3C? I believe that
it helps many people.

-
Shigemichi Yazawa
[EMAIL PROTECTED]




Re: Is there Unicode mail out there?

2001-07-20 Thread John Cowan

Tex Texin scripsit:

 Which seemed to me to rule out the NCR for gt; in situations other
 than ]] for compatibility reasons.
 
 If they are needed elsewhere, they must be escaped using either
 numeric character references or the strings amp; and lt;
 respectively. The right angle bracket () may be represented using
 the string gt;, and must, for compatibility, be escaped using
 gt; or a character reference when it appears in the string ]]
 in content, when that string is not marking the end of a CDATA
 section.

Naah.  Just because it says may doesn't mean anything: what may be
done, also may be not done.  You may use a numeric character
reference for any legal character.

-- 
John Cowan   [EMAIL PROTECTED]
One art/there is/no less/no more/All things/to do/with sparks/galore
--Douglas Hofstadter




Re: Is there Unicode mail out there?

2001-07-20 Thread Tex Texin

John,
ok and thanks. I wasn't looking at the may though, I was looking
at the must.

Maybe I am not parsing this sentence right. To me it says:

(must, for compatibility, be escaped using gt; )

or

(a character reference when it appears in the string ]] in
content, when that string is not marking the end of a CDATA
section.)

So it must not be an NCR, EXCEPT in the seemingly rare case where
the string ]] appears in content AND that string is not being
used to indicate the end of a CDATA section.

How is that supposed to be read?

tex

John Cowan wrote:
 
 Tex Texin scripsit:
 
  Which seemed to me to rule out the NCR for gt; in situations other
  than ]] for compatibility reasons.
 
  If they are needed elsewhere, they must be escaped using either
  numeric character references or the strings amp; and lt;
  respectively. The right angle bracket () may be represented using
  the string gt;, and must, for compatibility, be escaped using
  gt; or a character reference when it appears in the string ]]
  in content, when that string is not marking the end of a CDATA
  section.
 
 Naah.  Just because it says may doesn't mean anything: what may be
 done, also may be not done.  You may use a numeric character
 reference for any legal character.
 
 --
 John Cowan   [EMAIL PROTECTED]
 One art/there is/no less/no more/All things/to do/with sparks/galore
 --Douglas Hofstadter

-- 
---
Tex Texin  Director, International Business
mailto:[EMAIL PROTECTED]  +1-781-280-4271
Fax:+1-781-280-4655
the Progress Company   14 Oak Park, Bedford, MA 01730
---




RE: Is there Unicode mail out there?

2001-07-20 Thread Ayers, Mike


 From: Tex Texin [mailto:[EMAIL PROTECTED]] 

 So it must not be an NCR, EXCEPT in the seemingly rare case where
 the string ]] appears in content AND that string is not being
 used to indicate the end of a CDATA section.
 
 How is that supposed to be read?

Simple.  Since ]] is used to mark the end of a CDATA section, and
since CDATA can contain anything, if you want to put the sequence ]]
INSIDE your CDATA, then you must escape the , or else it will END your
CDATA.

In other words, CDATA can contain anything except literal ]].

Think */ and C/C++...

HTH,


/|/|ike




Re: Is there Unicode mail out there?

2001-07-20 Thread Mark Davis

The quotation I have is from my college Greek textbook (sadly my fluency has
reduced to essentially zero after all these years).

Perhaps some Greeks on the list could say which is the more accurate
formulation?

Mark
—

πάντων μέτρον ἄνθρωπος — Πρωταγόρας
[http://www.macchiato.com]

- Original Message -
From: Otto Stolz [EMAIL PROTECTED]
To: Mark Davis [EMAIL PROTECTED]
Cc: unicode [EMAIL PROTECTED]
Sent: Friday, July 20, 2001 09:18
Subject: Re: Is there Unicode mail out there?




Mark Davis wrote:


πάντων μέτρον ἄνθρωπος — Πρωταγόρας

You mean “πάντων χρημάτων μέτρον ἄνθρωπος”, dont 
you? ;-)

Best wishes,
  Otto Stolz








Re: Is there Unicode mail out there?

2001-07-19 Thread Shigemichi Yazawa

At Thu, 19 Jul 2001 15:52:39 +0900,
Martin Duerst [EMAIL PROTECTED] wrote:
 Of course then pattern restrictions on mixed content (which we
 currently don't have) would become really helpful.

Martin,

What kind of pattern restrictions are necessary by introducing C0 NCR?
Something like this? #x1b;$B

---
Shigemichi Yazawa
[EMAIL PROTECTED]




Re: Is there Unicode mail out there?

2001-07-19 Thread Andy Heninger

I agree with the overall sentiment here, but here's one nit

 Or you are so lazy that
 you want to put it [your data] in CDATA section without checking it at
all.

CDATA sections have a severe problem, which is that there is no
way to escape otherwise legal XML characters that can't be
represented in the chosen document encoding.

The best bet is to avoid CDATA sections altogether.

Andy Heninger
IBM, Cupertino, CA
[EMAIL PROTECTED]


- Original Message -
From: Shigemichi Yazawa [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Thursday, July 19, 2001 12:03 AM
Subject: RE: Is there Unicode mail out there?


 At Wed, 18 Jul 2001 14:21:35 -0500,
 Ayers, Mike [EMAIL PROTECTED] wrote:
  So why not used tagged data to represent C0 and C1 characters?  That
  is what XML is made of.  As far as why control characters are not
permitted,
  it seems to ma that this is so that XML documents can be passed around
  easily, through HTTP, email, FTP and so on, without loss of data.
Protocols
  abound which interpret control characters, so XML files which contain
data
  may get mangled or may mangle the systems which pass them.  However,
if that
  data is included as tagged hex digits, no problem will occur either
way.

 XML states Its goal is to enable generic SGML to be served, received,
 and processed on the Web in the way that is now possible with HTML.
 But, in my opinion, XML has outgrown its original goal way too
 far. XML seems to be used in every aspect of software engineering
 these days.

 Tagging disallowed characters is one way to work around the
 problem. But I don't buy this solution for two reasons.

 1. Markup is for describing a document's structure. 1 Introduction
says Markup encodes a description of the document's storage layout
and logical structure. You could do something like charEscape
codepoint=000c /. This doesn't express any structure of the
document, though. Using a markup merely to escape a character is
too hacky, in my opinion.

 2. This is a proprietary solution. To get the original character, the
apprication needs to know the semantics of the markup and needs to
know how to decode the data appropriately. If it's the standard
encoding like NCR, that's fine because everybody knows how to deal
with it. But the tagging is specific to a DTD. It makes difficult
to interchange the data.

 This character restriction in XML makes a XML document creation
 difficult. Say you have some data you want to wrap in XML. You don't
 know much anout the content of the data. What you know about it is its
 character encoding and that it is textual data. That's fine because
 you just want to wrap it in XML. You would check if it contains 
 or  and convert them to entity references. Or you are so lazy that
 you want to put it in CDATA section without checking it at all. The
 problem is that it might contain C0 control codes, which are legal
 characters for most of the encodings. Unless you are absolutely sure
 that the data doesn't contain any control codes, you have to check
 every characters to make sure that you don't produce ill-formed XML
 document. Even if you find a control, there isn't a standard way to
 treat it. You end up deleting it or escaping it in a proprietary way.

 -
 Shigemichi Yazawa
 [EMAIL PROTECTED]







RE: Is there Unicode mail out there?

2001-07-19 Thread Ayers, Mike


 From: John Cowan [mailto:[EMAIL PROTECTED]] 

 I think that any proposal to shrink the range of well-formed documents
 is simply a nonstarter, regrettable as that is.

I had thought that one of the main goals of XML Blueberry was
mainframe compatibility.  If so, won't they need to disallow the C1
characters which wreak havoc on mainframe terminals?  If they can make that
change, other relatively minor changes could be made at that time (if ever).
That's my thinking, anyway.

Should I be crossposting the XML folks on this?


/|/|ike




RE: Is there Unicode mail out there?

2001-07-19 Thread Ayers, Mike


 From: Shigemichi Yazawa [mailto:[EMAIL PROTECTED]] 

 XML states Its goal is to enable generic SGML to be served, received,
 and processed on the Web in the way that is now possible with HTML.
 But, in my opinion, XML has outgrown its original goal way too
 far. XML seems to be used in every aspect of software engineering
 these days.

True, but don't blame W3C for the digital hammer effect.

 Tagging disallowed characters is one way to work around the
 problem. But I don't buy this solution for two reasons.
 
 1. Markup is for describing a document's structure. 1 Introduction
says Markup encodes a description of the document's storage layout
and logical structure.

That's how it works in theory.  In practice, however, pictures,
applets, and many other non-structural components are encoded with markup.

 2. This is a proprietary solution. To get the original character, the
apprication needs to know the semantics of the markup and needs to
know how to decode the data appropriately. If it's the standard
encoding like NCR, that's fine because everybody knows how to deal
with it. But the tagging is specific to a DTD. It makes difficult
to interchange the data.

I'm proposing it as a convention, not a proprietary solution.  I
agree that a standard solution would be preferred, especially Martin's
suggestion of permitting the escape codes but not the characters.  I
proposed the markup as a workaround until a better solution could be found.

 This character restriction in XML makes a XML document creation
 difficult. 

The work has to be done somewhere.  Emerging technologies must be
compatible with existing ones, and some old technologies hang around a long
time.  Really, the disallowing of control characters makes sense, since
their interpretation in so many exisiting protocols is wreak havoc upon the
unsuspecting.  You simply can't send these characters around the internet
and expect them to arrive unchanged.


/|/|ike




Re: Is there Unicode mail out there?

2001-07-19 Thread Tex Texin

Lars,
I was looking at Section 2.4 Character Data and Markup:
http://www.w3.org/TR/2000/REC-xml-20001006#syntax

Which seemed to me to rule out the NCR for gt; in situations other
than ]] for compatibility reasons.

If they are needed elsewhere, they must be escaped using either
numeric character references or the strings amp; and lt;
respectively. The right angle bracket () may be represented using
the string gt;, and must, for compatibility, be escaped using
gt; or a character reference when it appears in the string ]]
in content, when that string is not marking the end of a CDATA
section.

tex

Lars Marius Garshol wrote:
 
 * Tex Texin
 |
 | XML restricts the character set which by implication restricts the
 | NCR values. I see that gt; can't use an NCR but lt; can.
 
 They can both use NCRs. In fact, the example definitions of the
 predefined entities do just that:
 
   URL: http://www.w3.org/TR/REC-xml#sec-predefined-ent 
 
 --Lars M.

-- 
---
Tex Texin  Director, International Business
mailto:[EMAIL PROTECTED]  +1-781-280-4271
Fax:+1-781-280-4655
the Progress Company   14 Oak Park, Bedford, MA 01730
---




Re: Is there Unicode mail out there?

2001-07-18 Thread Martin Duerst

At 14:30 01/07/17 -0700, Mark Davis wrote:
  In that case the content of the field is not text but an octet string,
  and you need to do something different, like base64-ing it.

The content in the database is not an octet string: it is a text field that
happens to have a control code -- a legitimate character code -- in it.
Practically every database allows control codes in text fields. (And why are
C1 controls allowed? After all, they are even less frequent than C0
controls.)

Mark - I understand your dissatisfaction. But the C1 controls are not
allowed in HTML4, and according to James Clark, the fact that they are
allowed in XML was an oversight.

Databases can (and should) keep care of their data. There are very
few cases where having control characters in there makes sense.
In the most cases, however, they are errors, and if XML gives an
incentive to fix them, all the better.

I wouldn't want any control codes in a database. Having a control-G
may be funny (the joke as I know it goes back to Don Knuth), but
something like a control-S is too much of a risk.


Regards,   Martin.




Re: Is there Unicode mail out there?

2001-07-18 Thread Mark Davis

 I wouldn't want any control codes in a database. Having a control-G
 may be funny (the joke as I know it goes back to Don Knuth), but
 something like a control-S is too much of a risk.

*You* wouldn't want?

There are a lot of characters *I* wish were not in databases, or in use at
all. A lot of them may or may not make sense. Whether or not I want them,
someone can have a database where they are allowed. By having this
(inconsistent) restriction, it simply means I can't be guaranteed full
round-tripping  from databases to XML and back, no matter what their
content.

Of course, this is not a huge restriction -- it is simply a gratuitous
annoyance. One could even live with something much more onerous, say XML
disallowing all characters whose code points were divisible by 4321 -- just
have complicated DTDs and shift into base64 if you encounter any of those
codes.

Mark
—

πάντων μέτρον ἄνθρωπος — Πρωταγόρας
[http://www.macchiato.com]

- Original Message -
From: Martin Duerst [EMAIL PROTECTED]
To: Mark Davis [EMAIL PROTECTED]; John Cowan
[EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]; Lars Marius Garshol [EMAIL PROTECTED]
Sent: Tuesday, July 17, 2001 18:36
Subject: Re: Is there Unicode mail out there?


 At 14:30 01/07/17 -0700, Mark Davis wrote:
   In that case the content of the field is not text but an octet string,
   and you need to do something different, like base64-ing it.
 
 The content in the database is not an octet string: it is a text field
that
 happens to have a control code -- a legitimate character code -- in it.
 Practically every database allows control codes in text fields. (And why
are
 C1 controls allowed? After all, they are even less frequent than C0
 controls.)

 Mark - I understand your dissatisfaction. But the C1 controls are not
 allowed in HTML4, and according to James Clark, the fact that they are
 allowed in XML was an oversight.

 Databases can (and should) keep care of their data. There are very
 few cases where having control characters in there makes sense.
 In the most cases, however, they are errors, and if XML gives an
 incentive to fix them, all the better.

 I wouldn't want any control codes in a database. Having a control-G
 may be funny (the joke as I know it goes back to Don Knuth), but
 something like a control-S is too much of a risk.


 Regards,   Martin.







Re: Is there Unicode mail out there?

2001-07-17 Thread Lars Marius Garshol


* Michael Everson
| 
| Perhaps I have been asleep, but is that notation (#X;) valid
| HTML for all Unicode characters?

The numeric character reference syntax is defined by SGML, and just
referenced by HTML, and in SGML it is defined in terms of the document
character set, which is defined by the SGML declaration used by each
SGML application (of which HTML is one instance).

The numeric character reference syntax can be used to refer to any
character in the document character set (as declared by the SGML
declaration used by HTML[1]). The document character set used by HTML
is Unicode, but some characters have been disallowed, and may not
appear in documents, whether directly or by reference. These are

 U+ - U+0009
 U+000B - U+000C
 U+000E - U+0019
 U+007F - U+009F
 U+D800 - U+DFFF 

--Lars M.

[1] URL: http://www.w3.org/TR/html401/sgml/sgmldecl.html 





Re: Is there Unicode mail out there?

2001-07-17 Thread Lars Marius Garshol


* Tex Texin
|
| XML restricts the character set which by implication restricts the
| NCR values. I see that gt; can't use an NCR but lt; can.

They can both use NCRs. In fact, the example definitions of the
predefined entities do just that:

  URL: http://www.w3.org/TR/REC-xml#sec-predefined-ent 

--Lars M.





Re: Is there Unicode mail out there?

2001-07-17 Thread Lars Marius Garshol


* Mark Davis
|
| The HTML spec depends on the SGML spec for a characterization of
| allowable characters. The latter, unfortunately, disallows some
| valid Unicode characters (most C0 controls), but inconsistently
| allows other similar characters (C1 controls). 

SGML is silent on the issue of what characters are allowed. It is the
SGML declaration used by each application which decides this, and you
can easily make an SGML declaration which allows every Unicode
character.

To wit:

!SGML  ISO 8879:1986 (WWW)
 CHARSET
  BASESET  ISO Registration Number 177//CHARSET
ISO/IEC 10646-1:1993 UCS-4 with
implementation level 3//ESC 2/5 2/15 4/6
 DESCSET 0   55296   0
 55296   2048UNUSED  -- SURROGATES --
 57344   1056768 57344 

CAPACITYSGMLREF
TOTALCAP15
GRPCAP  15
ENTCAP  15 

SCOPEDOCUMENT
SYNTAX
 SHUNCHAR NONE
 BASESET  ISO 646IRV:1991//CHARSET
   International Reference Version
   (IRV)//ESC 2/8 4/2
 DESCSET  0 128 0  FUNCTION
  RE13
  RS10
  SPACE 32
  TAB SEPCHAR9  

 NAMING   LCNMSTRT 
  UCNMSTRT 
  LCNMCHAR .-_:   
  UCNMCHAR .-_:
  NAMECASE GENERAL YES
   ENTITY  NO

 DELIMGENERAL  SGMLREF
  HCRO #38;#x   -- 38 is the number for ampersand --
  SHORTREF SGMLREF
 NAMESSGMLREF
 QUANTITY SGMLREF
  ATTCNT   60  -- increased --
  ATTSPLEN 65536   -- These are the largest values --
  LITLEN   65536   -- permitted in the declaration --
  NAMELEN  65536   -- Avoid fixed limits in actual --
  PILEN65536   -- implementations of HTML UA's --
  TAGLVL   100
  TAGLEN   65536
  GRPGTCNT 150
  GRPCNT   64 

FEATURES
  MINIMIZE
DATATAG  NO
OMITTAG  YES
RANK NO
SHORTTAG YES
  LINK
SIMPLE   NO
IMPLICIT NO
EXPLICIT NO
  OTHER
CONCUR   NO
SUBDOC   NO
FORMAL   YES
  APPINFO NONE


| That means that it is not possible in HTML (or more importantly, in
| XML) to represent all valid Unicode characters in data fields.

What would you want to use control characters for in an XML document?

--Lars M.





Re: Is there Unicode mail out there?

2001-07-17 Thread Mark Davis

I had been told by the W3C people that the reason for forbidding control
characters in XML and HTML was for compatibility with SGML. I've never
checked it, since unfortunately the SGML standard is not online. If not
true, that's very interesting.

When you are thinking of XML as a general transmission mechanism for data
(not just a text document) it becomes clear. Suppose that you have a
database, of any sort. Some fields may or may not contain control
characters -- since control characters are perfectly legal in many if not
all databases. You want to query that database and get a selection, packaged
as XML.

Unfortunately, you have to invent your own home-brew quoting mechanism for
the control characters, since the standard XML does not permit you to
represent all of the -- perfectly valid -- characters in that database. And
such a home-brew mechanism will not interwork with anything else.

Conversely, you could filter out the control characters. That, of course,
would corrupt the data. Generally considered a bad thing.

Mark

—

πάντων μέτρον ἄνθρωπος — Πρωταγόρας
[http://www.macchiato.com]

- Original Message -
From: Lars Marius Garshol [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Tuesday, July 17, 2001 02:28
Subject: Re: Is there Unicode mail out there?



 * Mark Davis
 |
 | The HTML spec depends on the SGML spec for a characterization of
 | allowable characters. The latter, unfortunately, disallows some
 | valid Unicode characters (most C0 controls), but inconsistently
 | allows other similar characters (C1 controls).

 SGML is silent on the issue of what characters are allowed. It is the
 SGML declaration used by each application which decides this, and you
 can easily make an SGML declaration which allows every Unicode
 character.

 To wit:

 !SGML  ISO 8879:1986 (WWW)
  CHARSET
   BASESET  ISO Registration Number 177//CHARSET
 ISO/IEC 10646-1:1993 UCS-4 with
 implementation level 3//ESC 2/5 2/15 4/6
  DESCSET 0   55296   0
  55296   2048UNUSED  -- SURROGATES --
  57344   1056768 57344

 CAPACITYSGMLREF
 TOTALCAP15
 GRPCAP  15
 ENTCAP  15

 SCOPEDOCUMENT
 SYNTAX
  SHUNCHAR NONE
  BASESET  ISO 646IRV:1991//CHARSET
International Reference Version
(IRV)//ESC 2/8 4/2
  DESCSET  0 128 0  FUNCTION
   RE13
   RS10
   SPACE 32
   TAB SEPCHAR9

  NAMING   LCNMSTRT 
   UCNMSTRT 
   LCNMCHAR .-_:
   UCNMCHAR .-_:
   NAMECASE GENERAL YES
ENTITY  NO

  DELIMGENERAL  SGMLREF
   HCRO #38;#x   -- 38 is the number for ampersand --
   SHORTREF SGMLREF
  NAMESSGMLREF
  QUANTITY SGMLREF
   ATTCNT   60  -- increased --
   ATTSPLEN 65536   -- These are the largest values --
   LITLEN   65536   -- permitted in the declaration --
   NAMELEN  65536   -- Avoid fixed limits in actual --
   PILEN65536   -- implementations of HTML UA's --
   TAGLVL   100
   TAGLEN   65536
   GRPGTCNT 150
   GRPCNT   64

 FEATURES
   MINIMIZE
 DATATAG  NO
 OMITTAG  YES
 RANK NO
 SHORTTAG YES
   LINK
 SIMPLE   NO
 IMPLICIT NO
 EXPLICIT NO
   OTHER
 CONCUR   NO
 SUBDOC   NO
 FORMAL   YES
   APPINFO NONE
 

 | That means that it is not possible in HTML (or more importantly, in
 | XML) to represent all valid Unicode characters in data fields.

 What would you want to use control characters for in an XML document?

 --Lars M.








Re: Is there Unicode mail out there?

2001-07-17 Thread Mark Davis

 In that case the content of the field is not text but an octet string,
 and you need to do something different, like base64-ing it.

The content in the database is not an octet string: it is a text field that
happens to have a control code -- a legitimate character code -- in it.
Practically every database allows control codes in text fields. (And why are
C1 controls allowed? After all, they are even less frequent than C0
controls.)

Your task is to design an XML DTD to represent a selection from a database.
The database is nothing fancy: Latin-1 encoded. It is conceivable that a
control character is in one of the hundreds of thousands of records. Not
likely, but conceivable. You must guarantee no loss of data in the XML
representation of the data.

If XML could represent all control characters, then an instance of a
selection in XML might be as simple as the following.

record
  firstnameJohn/firstname
  lastnameSmith/lastname
  birthdate1950-10-10/birthdate
...
/record

The DTD would also be simple. Now, change the DTD (*and* the program that
interprets it) so that each and every text field could be a base64 instead.
Very ugly. You don't want to simply change all the fields to base64, since
that would (a) bulk them up and (b) make them unreadable for debugging. So
you end up having each field have two alternate representations. And in your
parser you have to be prepared for either, and in your generator you have to
pick between them.

Notice that for *any* database that allows control codes, to avoid data
corruption you would have to do such ugliness for any XML representation. Of
course, nobody does it, which means that there is always the opportunity for
data corruption. Of course, one might just not care -- after all, it would
be rare that this would cause a problem.

Mark

—

πάντων μέτρον ἄνθρωπος — Πρωταγόρας
[http://www.macchiato.com]

- Original Message -
From: John Cowan [EMAIL PROTECTED]
To: Mark Davis [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]; Lars Marius Garshol [EMAIL PROTECTED];
Martin Duerst [EMAIL PROTECTED]
Sent: Tuesday, July 17, 2001 11:10
Subject: Re: Is there Unicode mail out there?


 Mark Davis wrote:

  I had been told by the W3C people that the reason for forbidding control
  characters in XML and HTML was for compatibility with SGML.


 More accurately, with the SGML default syntax, which is used in HTML
 and (with a few modifications) in XML.


  When you are thinking of XML as a general transmission mechanism for
data
  (not just a text document) it becomes clear. Suppose that you have a
  database, of any sort. Some fields may or may not contain control
  characters -- since control characters are perfectly legal in many if
not
  all databases. You want to query that database and get a selection,
packaged
  as XML.


 In that case the content of the field is not text but an octet string,
 and you need to do something different, like base64-ing it.

 --
 There is / one art || John Cowan [EMAIL PROTECTED]
 no more / no less  || http://www.reutershealth.com
 to do / all things || http://www.ccil.org/~cowan
 with art- / lessness   \\ -- Piet Hein







Re: Is there Unicode mail out there?

2001-07-17 Thread DougEwell2

In a message dated 2001-07-17 2:24:44 Pacific Daylight Time, 
[EMAIL PROTECTED] writes:

  The document character set used by HTML
  is Unicode, but some characters have been disallowed, and may not
  appear in documents, whether directly or by reference. These are

   U+ - U+0009
   U+000B - U+000C
   U+000E - U+0019
   U+007F - U+009F
   U+D800 - U+DFFF 

This list, and others like it, needs to be updated to include the 
non-characters (0xFDD0 through 0xFDEF, plus all code points whose low-order 
16 bits are 0xFFFE or 0x).

I was just looking through the XML spec today, and the only non-characters 
excluded (other than the surrogates) are 0xFFFE and 0x.

-Doug Ewell
 Fullerton, California




Re: Is there Unicode mail out there?

2001-07-17 Thread John Cowan

[EMAIL PROTECTED] scripsit:

 I was just looking through the XML spec today, and the only non-characters 
 excluded (other than the surrogates) are 0xFFFE and 0x.

Unfortunately, there's nothing we can do about it now, nor about the useless
C1 controls other than NEL.  Shrinking the range of well-formed documents
is an immediate loser, even if there is no plausible use for such
documents.

Just pretend you'll never get one of the legal non-characters.

-- 
John Cowan   [EMAIL PROTECTED]
One art/there is/no less/no more/All things/to do/with sparks/galore
--Douglas Hofstadter




Re: Is there Unicode mail out there?

2001-07-16 Thread Shigemichi Yazawa

At Sat, 14 Jul 2001 09:49:30 -0700,
Mark Davis [EMAIL PROTECTED] wrote:
 
 No, but it is for the vast majority.
 
 Some have to be written specially, e.g. lt;

I looked at XML 1.0 spec and it says in 2.4 Character Data and Markup
that

If they are needed elsewhere, they must be escaped using either
numeric character references or the strings amp; and lt;
respectively.

I also looked at HTML 4.01 spec and it doesn't say in 5.3.2 Character
entity references that #60; cannot be used to represent .

 Some cannot be written at all, e.g. U+0007 (but U+0087 can be!)

This is true for XML, but I couldn't find any statement in HTML 4.01
spec to restrict the use of U+0007 in HTML document.

By the way, I have been pondering why, in XML, all the C1 control
characters are legal but some of the C0 control characters are
not. 2.2 Characters says that Legal characters are tab, carriage
return, line feed, and the legal characters of Unicode and ISO/IEC
10646. and the BNF for Char is this.

[2] Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] |/* any Unicode character,
 [#xE000-#xFFFD] | [#x1-#x10] excluding the surrogate blocks,
  FFFE, and . */

Does this mean C0 controls are not legal Unicode characters?

---
Shigemichi Yazawa
[EMAIL PROTECTED]




Re: Is there Unicode mail out there?

2001-07-16 Thread Mark Davis

The HTML spec depends on the SGML spec for a characterization of allowable
characters. The latter, unfortunately, disallows some valid Unicode
characters (most C0 controls), but inconsistently allows other similar
characters (C1 controls). That means that it is not possible in HTML (or
more importantly, in XML) to represent all valid Unicode characters in data
fields.

Mark

—

πάντων μέτρον ἄνθρωπος — Πρωταγόρας
[http://www.macchiato.com]

- Original Message -
From: Shigemichi Yazawa [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]; [EMAIL PROTECTED]
Sent: Monday, July 16, 2001 12:12
Subject: Re: Is there Unicode mail out there?


 At Sat, 14 Jul 2001 09:49:30 -0700,
 Mark Davis [EMAIL PROTECTED] wrote:
 
  No, but it is for the vast majority.
 
  Some have to be written specially, e.g. lt;

 I looked at XML 1.0 spec and it says in 2.4 Character Data and Markup
 that

 If they are needed elsewhere, they must be escaped using either
 numeric character references or the strings amp; and lt;
 respectively.

 I also looked at HTML 4.01 spec and it doesn't say in 5.3.2 Character
 entity references that #60; cannot be used to represent .

  Some cannot be written at all, e.g. U+0007 (but U+0087 can be!)

 This is true for XML, but I couldn't find any statement in HTML 4.01
 spec to restrict the use of U+0007 in HTML document.

 By the way, I have been pondering why, in XML, all the C1 control
 characters are legal but some of the C0 control characters are
 not. 2.2 Characters says that Legal characters are tab, carriage
 return, line feed, and the legal characters of Unicode and ISO/IEC
 10646. and the BNF for Char is this.

 [2] Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] |/* any Unicode
character,
  [#xE000-#xFFFD] | [#x1-#x10] excluding the surrogate
blocks,
   FFFE, and . */

 Does this mean C0 controls are not legal Unicode characters?

 ---
 Shigemichi Yazawa
 [EMAIL PROTECTED]







RE: Is there Unicode mail out there?

2001-07-15 Thread Christopher J Fynn

Gaute B Strokkenes wrote:

 ...
That's the only benefit that Unicode and UTF-8 will bring to email:
the ability to mix and match characters from all scripts of all sizes
and shapes in a single message.  OTOH, for those of us who need this
it's a big advantage.


There are also a number of scripts which don't have any registered 
encoding or code-page except Unicode / ISO-10646 - for users of those
scripts, whether or not they want to mix characters from other 
scripts, Unicode / UTF-8 is the only real choice (unless they want to 
use some non-standard font based encoding).

However, since many of these scripts are also complex scripts, 
clients need to be able to render them properly to be of much use
with these scripts. 

- Chris




RE: Is there Unicode mail out there?

2001-07-15 Thread Christopher J Fynn


Mark Davies wrote:

 
Take a look at the XML standard.

Mark


The thread was discussing HTML. Are there any restrictions on numeric character 
references in the *HTML* standard?

- Chris


 




Re: Is there Unicode mail out there?

2001-07-15 Thread Tex Texin

Mark,
ok thanks. XML restricts the character set which by implication
restricts the NCR values. I see that gt; can't use an NCR but lt;
can.

tex

Mark Davis wrote:
 
 Take a look at the XML standard.
 
 Mark
 - Original Message -
 From: Tex Texin [EMAIL PROTECTED]
  Hi. I am not sure why you say this. lt; is often used for 
  but #X003C; works in both IE 5 and Netscape 4.7.
 
  #X0007; shows a box though...
 
  But I was not aware of any restrictions on numeric character
  references. Is there a list of restrictions somewhere?
  tex

  Mark Davis wrote:
   No, but it is for the vast majority.
   Some have to be written specially, e.g. lt;
   Some cannot be written at all, e.g. U+0007 (but U+0087 can be!)

-- 
---
Tex Texin  Director, International Business
mailto:[EMAIL PROTECTED]  +1-781-280-4271
Fax:+1-781-280-4655
the Progress Company   14 Oak Park, Bedford, MA 01730
---




Re: Is there Unicode mail out there?

2001-07-15 Thread Mark Davis

yes
- Original Message -
From: Christopher J Fynn [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Cc: Mark Davis [EMAIL PROTECTED]
Sent: Saturday, July 14, 2001 22:57
Subject: RE: Is there Unicode mail out there?



 Mark Davies wrote:

 
 Take a look at the XML standard.

 Mark
 

 The thread was discussing HTML. Are there any restrictions on numeric
character references in the *HTML* standard?

 - Chris










Re: Is there Unicode mail out there?

2001-07-14 Thread David Starner

From: Gaute B Strokkenes [EMAIL PROTECTED]
 No way.  Any mail client that is sufficiently clever to understand
 UTF-8 should understand all valid and registered MIME-charsets.  After
 all, conversion libraries are both widely available and easy to use.

Do you know of any that actually do? How about just supporting these:
ISO646-PT, ISO10646-UTF-1, NATS-SEFI and HP-DeskTop?

 All the `all messages should be in UTF-8, even when there are
 well-established legacy encodings that cover the characters of a given
 message' mumbo-jumbo that has been mentioned recently on the list is
 really just so much hot air.

I don't think anyone was suggesting that for all lists. However, here, on
the Unicode list, everyone on the list should be able to handle Unicode, and
those who can have sometimes been willing to cut and paste into a Unicode
editor just to see what's up. Legacy encodings should be used when you're
communicating with people who use legacy encodings and legacy mail readers.
Unicode people don't - after ASCII, UTF-8 is probably the closest thing we
have to a common usable encoding.

--
David Starner - [EMAIL PROTECTED]





Re: Is there Unicode mail out there?

2001-07-14 Thread Gaute B Strokkenes

On Sat, 14 Jul 2001, [EMAIL PROTECTED] wrote:
 From: Gaute B Strokkenes [EMAIL PROTECTED]
 No way.  Any mail client that is sufficiently clever to understand
 UTF-8 should understand all valid and registered MIME-charsets.
 After all, conversion libraries are both widely available and easy
 to use.
 
 Do you know of any that actually do?

Actually do convert messages in arbitrary charsets to UTF-8 / Unicode,
you mean?  Any reasonably modern mail client will.  IIRC Microsoft OE
and friends do everything in Unicode internally and only convert to
other encodings when receiving or sending mail.  (Though OE is broken
in so many other ways that I wouldn't recommend it.)  Gnus/Emacs does
too (actually it uses the Emacs MULE encoding internally, but from the
users perspective the effect is precisely the same).

My argument is based on the fact that if you have put in the necessary
work to interpret UTF-8 messages, then it does not take at all that
much extra effort to interpret messages in other charsets by running
them through a converter first.  I postulate that libraries to perform
this function are both widely available and highly portable; if you do
not agree then I would be happy to point out concrete examples.

 How about just supporting these: ISO646-PT, ISO10646-UTF-1,
 NATS-SEFI and HP-DeskTop?

I'm not sure what you're trying to say here.  Assuming these are
properly registered charsets, it seems like a very narrow range to
support.  If they're not, then they have no place in email whatsoever
(and UTF-8 is clearly a better choice.)

 I don't think anyone was suggesting that for all lists. However,
 here, on the Unicode list, everyone on the list should be able to
 handle Unicode, and those who can have sometimes been willing to cut
 and paste into a Unicode editor just to see what's up.

I don't think that holds.  People on the unicode list are not
necessarily Unicode boffins, although a lot of the active people are.
Some of us are just here because we have an interest in, say, i18n in
general and like to keep an eye on things.  If we all had to upgrade
our software to do so, I think a lot of people just wouldn't bother.
That way, everyone loses.

Note that I think it is appropriate to use UTF-8 when there's just no
common charset that can represent a given message.

 Legacy encodings should be used when you're communicating with
 people who use legacy encodings and legacy mail readers.  Unicode
 people don't - after ASCII, UTF-8 is probably the closest thing we
 have to a common usable encoding.

It's the closest thing that we have to a common _universal_ charset.
For messages that do not require the `universal' property, there are
many charsets that are just as sensible and, more to the point, much
better supported.

-- 
Gaute Strokkeneshttp://www.srcf.ucam.org/~gs234/
Yow!  Am I in Milwaukee?




Re: Is there Unicode mail out there?

2001-07-14 Thread Michael Everson

At 11:07 -0400 2001-07-13, Tex Texin wrote:

Maybe writing the value as an HTML numeric character reference (e.g. 
#X20AC;) would also make it easier for processes reading files 
saved by the mailer
to recover the character.

Perhaps I have been asleep, but is that notation (#X;) valid 
HTML for all Unicode characters?
-- 
Michael Everson




Re: Is there Unicode mail out there?

2001-07-14 Thread Daniel Biddle

On Sat, Jul 14, 2001 at 01:10:15PM +0100, Michael Everson wrote:
 At 11:07 -0400 2001-07-13, Tex Texin wrote:
 
 Maybe writing the value as an HTML numeric character reference (e.g. 
 #X20AC;) would also make it easier for processes reading files 
 saved by the mailer
 to recover the character.
 
 Perhaps I have been asleep, but is that notation (#X;) valid 
 HTML for all Unicode characters?

Since HTML 4, yes: http://www.w3.org/TR/html4/charset.html#h-5.3.1

-- 
Daniel Biddle [EMAIL PROTECTED]




Re: Is there Unicode mail out there?

2001-07-14 Thread Mark Davis

No, but it is for the vast majority.

Some have to be written specially, e.g. lt;

Some cannot be written at all, e.g. U+0007 (but U+0087 can be!)

Mark
- Original Message - 
From: Michael Everson [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Saturday, July 14, 2001 05:10
Subject: Re: Is there Unicode mail out there?


 At 11:07 -0400 2001-07-13, Tex Texin wrote:
 
 Maybe writing the value as an HTML numeric character reference (e.g. 
 #X20AC;) would also make it easier for processes reading files 
 saved by the mailer
 to recover the character.
 
 Perhaps I have been asleep, but is that notation (#X;) valid 
 HTML for all Unicode characters?
 -- 
 Michael Everson
 
 





Re: Is there Unicode mail out there?

2001-07-14 Thread Michael Everson

At 09:49 -0700 2001-07-14, Mark Davis wrote:

   Maybe writing the value as an HTML numeric character reference (e.g.
   #X20AC;) would also make it easier for processes reading files
   saved by the mailer
   to recover the character.
  
   Perhaps I have been asleep, but is that notation (#X;) valid
   HTML for all Unicode characters?

No, but it is for the vast majority.

Some have to be written specially, e.g. lt;

Some cannot be written at all, e.g. U+0007 (but U+0087 can be!)

Then it's not standard and can't be relied upon. Pity.
-- 
Michael Everson




Re: Is there Unicode mail out there?

2001-07-14 Thread Michael \(michka\) Kaplan


michka

the only book on internationalization in VB at
http://www.i18nWithVB.com/

- Original Message -
From: Michael Everson [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Saturday, July 14, 2001 9:56 AM
Subject: Re: Is there Unicode mail out there?


 At 09:49 -0700 2001-07-14, Mark Davis wrote:

Maybe writing the value as an HTML numeric character reference (e.g.
#X20AC;) would also make it easier for processes reading files
saved by the mailer
to recover the character.
   
Perhaps I have been asleep, but is that notation (#X;) valid
HTML for all Unicode characters?
 
 No, but it is for the vast majority.
 
 Some have to be written specially, e.g. lt;
 
 Some cannot be written at all, e.g. U+0007 (but U+0087 can be!)

 Then it's not standard and can't be relied upon. Pity.
 --
 Michael Everson







Re: Is there Unicode mail out there?

2001-07-14 Thread Michael \(michka\) Kaplan

From: Michael Everson [EMAIL PROTECTED]

 Then it's not standard and can't be relied upon. Pity.

Actually, it is a standard, as of HTML 4.0. All you need is compliant
browser.

MichKa

Michael Kaplan
Trigeminal Software, Inc.
http://www.trigeminal.com/






Re: Is there Unicode mail out there?

2001-07-14 Thread G. Adam Stanislav

At 12:03 2001-07-13 EDT, [EMAIL PROTECTED] wrote:
Unfortunately, the Windows world has no concept of a Last Resort font.  It 
would certainly seem to be a useful solution in cases like this.

Does a PostScript, Type 1, version of such a font exist for
download somewhere?

Adam
--- 
http://phonecowboy.com/registrar/twist/ finds a good domain for you
and checks for its existence.




Re: Is there Unicode mail out there?

2001-07-14 Thread David Starner

From: Gaute B Strokkenes [EMAIL PROTECTED]
 On Sat, 14 Jul 2001, [EMAIL PROTECTED] wrote:
  From: Gaute B Strokkenes [EMAIL PROTECTED]
  No way.  Any mail client that is sufficiently clever to understand
  UTF-8 should understand all valid and registered MIME-charsets.
  After all, conversion libraries are both widely available and easy
  to use.
 
  Do you know of any that actually do?

 Actually do convert messages in arbitrary charsets to UTF-8 / Unicode,
 you mean?

No, I mean understand all valid and registered MIME-charsets.

  How about just supporting these: ISO646-PT, ISO10646-UTF-1,
  NATS-SEFI and HP-DeskTop?

 I'm not sure what you're trying to say here.  Assuming these are
 properly registered charsets, it seems like a very narrow range to
 support.

Maybe supporting at least these would have been a better phrasing. They're
all valid and registered MIME-charsets. Do you know of a single mailer that
supports all 4?

 If we all had to upgrade
 our software to do so, I think a lot of people just wouldn't bother.

You're claiming on one hand that everyone's mailer should handle all sorts
of charsets, and on the other using one that doesn't support the only
charset that is RFC-mandated for a working mail program to support. (Yes, a
mailer that doesn't handle UTF-8 violates the appropriate RFCs.)

 It's the closest thing that we have to a common _universal_ charset.

You sure? Besides ASCII, what other charset can almost everyone read
(including the people who cut and paste into Unicode editors, because they
can read it)? There's no other charset (besides ASCII) that everyone with a
working mailer, no matter how minimal, can read.

--
David Starner - [EMAIL PROTECTED]





Re: Is there Unicode mail out there?

2001-07-14 Thread Gaute B Strokkenes

On Sat, 14 Jul 2001, [EMAIL PROTECTED] wrote:
  How about just supporting these: ISO646-PT, ISO10646-UTF-1,
  NATS-SEFI and HP-DeskTop?

 I'm not sure what you're trying to say here.  Assuming these are
 properly registered charsets, it seems like a very narrow range to
 support.
 
 Maybe supporting at least these would have been a better
 phrasing. They're all valid and registered MIME-charsets. Do you
 know of a single mailer that supports all 4?

OK, I get your point.  There are a lot of obscure charsets out there,
and it's probably not necessary to make sure that mail clients
understand all of them since a lot of these have no precedent for use
in email.  Nevertheless, there are a number of charsets--ISO-8859-1,
ISO-8859-2, KOI8-R, Shift_JIS and so on--that have widespread
precedent for use in email, and are de-facto standards for email in
certain languages.  It would be extremely foolish to implement a mail
client that understands UTF-8 but not these.

 If we all had to upgrade our software to do so, I think a lot of
 people just wouldn't bother.
 
 You're claiming on one hand that everyone's mailer should handle all
 sorts of charsets, and on the other using one that doesn't support
 the only charset that is RFC-mandated for a working mail program to
 support.

I'm sorry, but you're mixing things up a bit.  Keep in mind that in
general there is a difference between what processes implementing
Internet protocols should generate and what they are required to
accept.  One of the principles that the Internet is founded on is to
be liberal in what you accept, and conservative in what you produce.

 (Yes, a mailer that doesn't handle UTF-8 violates the appropriate
 RFCs.)

Chapter and verse, please?  The only document I could find that puts
forth such a requirement is the one at:

  http://www.imc.org/mail-i18n.html

which is not a RFC.  Other than that, there is RFC 2277; however this
only states that protocols must make it possible to exchange textual
data using UTF-8; it doesn't make it mandatory to understand UTF-8.

RFC 2049 only states that US-ASCII must be understood, and the same
for the ISO-8859-X charsets, except that you're not required to be
able to display the non-ASCII characters they contain.  There's no
mention of UTF-8.

If you have any better references, please provide them.  (I do not
claim to have encyclopedic knowledge off the subject.)

Note that the IMC document does not encourage mail clients to produce
UTF-8 by default, it only states that mail clients should be able to
interpret it and given users the option to create messages in UTF-8.
It explicitly recognises that that few mail clients implemented good
UTF-8 support at the time.  That was three years ago, and little has
changed since.  It is only very recently that good UTF-8 support has
become standard for new clients, and there are still lots and lots of
old clients that have no UTF-8 support at all.  It is certainly clear
that the time scale hinted at in the document (that all mail clients
created or revised after 1 January 1999 should be able to interpret
UTF-8) was hopelessly optimistic.  We're not there yet, even though
we're getting closer.

 It's the closest thing that we have to a common _universal_
 charset.
 
 You sure? Besides ASCII, what other charset can almost everyone read
 (including the people who cut and paste into Unicode editors,
 because they can read it)? There's no other charset (besides ASCII)
 that everyone with a working mailer, no matter how minimal, can
 read.

Well, I'm saying that UTF-8 / Unicode is the closest thing that we
have to a universal charset.  (I meant universal as in universal
character repertoire, not universally supported.)  There are many
charsets that are better supported in general than UTF-8; ASCII and
ISO-8859-1 are two of them.

However, the problem in question is not to choose the best charset
in general, but to choose the best possible charset for a given
message containing a given set of characters.  RFC 2046 states:

   More generally, if a widely-used character set is a subset of
   another character set, and a body contains only characters in the
   widely-used subset, it should be labelled as being in that subset.
   This will increase the chances that the recipient will be able to
   view the resulting entity correctly.

I think this is good advice.  Consider the scenario where a group of
people are accustomed to exchanging email in the language of their
choice in a particular charset with little difficulty.  Then some
members of the group upgrade their software, and the other members of
the group can then no longer read their messages, since the new
software insists on using UTF-8 (which the older software does not
support).  That's bad, and the above advice avoids this situation.

-- 
Gaute Strokkeneshttp://www.srcf.ucam.org/~gs234/
I'm thinking about DIGITAL READ-OUT systems and
 computer-generated IMAGE FORMATIONS..




Re: Is there Unicode mail out there?

2001-07-14 Thread Tex Texin

Mark, 
Hi. I am not sure why you say this. lt; is often used for 
but #X003C; works in both IE 5 and Netscape 4.7.

#X0007; shows a box though...

But I was not aware of any restrictions on numeric character
references. Is there a list of restrictions somewhere?
tex


Mark Davis wrote:
 
 No, but it is for the vast majority.
 
 Some have to be written specially, e.g. lt;
 
 Some cannot be written at all, e.g. U+0007 (but U+0087 can be!)
 
 Mark
 - Original Message -
 From: Michael Everson [EMAIL PROTECTED]
 To: [EMAIL PROTECTED]
 Sent: Saturday, July 14, 2001 05:10
 Subject: Re: Is there Unicode mail out there?
 
  At 11:07 -0400 2001-07-13, Tex Texin wrote:
 
  Maybe writing the value as an HTML numeric character reference (e.g.
  #X20AC;) would also make it easier for processes reading files
  saved by the mailer
  to recover the character.
 
  Perhaps I have been asleep, but is that notation (#X;) valid
  HTML for all Unicode characters?
  --
  Michael Everson

-- 
---
Tex Texin  Director, International Business
mailto:[EMAIL PROTECTED]  +1-781-280-4271
Fax:+1-781-280-4655
the Progress Company   14 Oak Park, Bedford, MA 01730
---




Re: Is there Unicode mail out there?

2001-07-14 Thread Mark Davis

Take a look at the XML standard.

Mark
- Original Message - 
From: Tex Texin [EMAIL PROTECTED]
To: Mark Davis [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]; Michael Everson [EMAIL PROTECTED]
Sent: Saturday, July 14, 2001 21:15
Subject: Re: Is there Unicode mail out there?


 Mark, 
 Hi. I am not sure why you say this. lt; is often used for 
 but #X003C; works in both IE 5 and Netscape 4.7.
 
 #X0007; shows a box though...
 
 But I was not aware of any restrictions on numeric character
 references. Is there a list of restrictions somewhere?
 tex
 
 
 Mark Davis wrote:
  
  No, but it is for the vast majority.
  
  Some have to be written specially, e.g. lt;
  
  Some cannot be written at all, e.g. U+0007 (but U+0087 can be!)
  
  Mark
  - Original Message -
  From: Michael Everson [EMAIL PROTECTED]
  To: [EMAIL PROTECTED]
  Sent: Saturday, July 14, 2001 05:10
  Subject: Re: Is there Unicode mail out there?
  
   At 11:07 -0400 2001-07-13, Tex Texin wrote:
  
   Maybe writing the value as an HTML numeric character reference (e.g.
   #X20AC;) would also make it easier for processes reading files
   saved by the mailer
   to recover the character.
  
   Perhaps I have been asleep, but is that notation (#X;) valid
   HTML for all Unicode characters?
   --
   Michael Everson
 
 -- 
 ---
 Tex Texin  Director, International Business
 mailto:[EMAIL PROTECTED]  +1-781-280-4271
 Fax:+1-781-280-4655
 the Progress Company   14 Oak Park, Bedford, MA 01730
 ---
 
 





Re: Is there Unicode mail out there?

2001-07-13 Thread DougEwell2

In a message dated 2001-07-12 8:55:07 Pacific Daylight Time, 
[EMAIL PROTECTED] writes:

  So the proposal is that minimizing the charset is a good thing?

  This means that you and I start out in a conversation about a
  product I am trying to sell you, it happens to be all in ascii
  and we exchange several mails successfully. Then I quote you
  a price in Euros and my 1252 message gets corrupted by your
  reader which can handle either only 8859-1 or ASCII, and
  you miss the fact that the Euro is corrupted and think we
  are talking dollars or some other currency.

  Although I understand why you would want a minimal charset in order
  to not needlessly prevent communications, the implication of
  reliability and trust that is built by having some success is
  a problem. You think you are communicating successfully but when it
  is critical it may not...

The premise seems to be that we should reject, or at least issue a warning 
against, the earlier messages on the basis that the sender *might* be able to 
send characters in the future that the receiver could not receive.  Sorry, 
but I can't buy into that.  That would prevent the CP1252 user from ever 
being able to communicate adequately with anyone who has only ISO 8859-1.

What if I am trying to exchange mail with a user of Windows-1256?  Lots of 
roadblocks would be erected because of the chance that the guy *might* send 
me ARABIC LETTER ALEF WITH HAMZA BELOW and I couldn't interpret it.  And I 
couldn't exchange mail with UTF-8 users either, because of that YI SYLLABLE 
BBOP they might send me some day.

  Perhaps if a harder line was taken when characters
  are used that cannot be converted, this would make more sense.
  (ie give a very clear recognizable indication of corruption or
  conversion failures)

That's reasonable.  Simply replacing unknown characters with '?' doesn't 
work; the character is too easily overlooked.  I would like to see mailers 
replace unsupported characters with a Unicode representation like [U+A068]. 
 (That would certainly help with this spate of CJK characters that people are 
sending lately on the Unicode list!)  I suspect that's too much Unicode 
awareness to ask of an otherwise Unicode-unaware product, though.

-Doug Ewell
 Fullerton, California




Re: Is there Unicode mail out there?

2001-07-13 Thread $B$F$s$I$&$j$e$&$8(B



$B!!!z$8$e$&$$$C$A$c$s!z(B

$B!!;d$O$m$3$($s$i$+$Y$5!#(B



Am 2001-07-13 um 2:53 h EDT hat Doug Ewell geschrieben:
 Simply replacing unknown characters with '?' doesn't work; the
 character is too easily overlooked.  I would like to see mailers
 replace unsupported characters with a Unicode representation
 like "[U+A068]".

For "ordinary users", i. e., those users who don't have the TUS 3.0
tome lying next to their computers, a "last resort glyph" would
probably be more helpful, cf. http://crl.nmsu.edu/~mleisher/lr.html
and http://fonts.apple.com/LastResort/LastResort.html.

Best wishes,
  Otto Stolz



They can look it up online.

Yes, it is a tome
Not just a book, a TOME.


Re: Is there Unicode mail out there?

2001-07-13 Thread Tex Texin

Doug,
I thought I had acknowledged the rationale for supporting labeling
the message with the
minimal charset based on each message's contents in the beginning
of the third paragraph, but maybe I should have expanded on it.
Anyway, despite the benefit it is a significant problem that
it is unreliable and that past performance does not
predict future performance or whatever the phrase is that the
financial markets use.

I was mostly stage setting for the idea that there should be a
clear indicator for a failed character conversion. The last resort
proposal is ok. I agree with you about seeing the hex value for the
missing
character with the symbol. (I've already been forced to learn the
unicode codepoint for the Euro by heart... I would probably
recognize
most of the commonly failed characters if the code points were
available.) Maybe writing the value
as an HTML numeric character reference (e.g. #X20AC;) would also
make it easier for processes reading files saved by the mailer 
to recover the character. (By using a standard representation and
also one that is not likely to appear in an email, unless the email
is
about character references...)
For the unicode-unaware the syntax could allow inclusion of the
original
code page label: #X0080:windows1256;

Anyway, this problem that characters that do not convert in mails
are not being clearly indicated:
occurs frequently, 
can have significant impact to users,
seems to have some cheap workarounds,
that are better than either just relabeling to the lowest common
denominator or 
preventing communications entirely.

tex




[EMAIL PROTECTED] wrote:
 
 In a message dated 2001-07-12 8:55:07 Pacific Daylight Time,
 [EMAIL PROTECTED] writes:
 
   So the proposal is that minimizing the charset is a good thing?
 
   This means that you and I start out in a conversation about a
   product I am trying to sell you, it happens to be all in ascii
   and we exchange several mails successfully. Then I quote you
   a price in Euros and my 1252 message gets corrupted by your
   reader which can handle either only 8859-1 or ASCII, and
   you miss the fact that the Euro is corrupted and think we
   are talking dollars or some other currency.
 
   Although I understand why you would want a minimal charset in order
   to not needlessly prevent communications, the implication of
   reliability and trust that is built by having some success is
   a problem. You think you are communicating successfully but when it
   is critical it may not...
 
 The premise seems to be that we should reject, or at least issue a warning
 against, the earlier messages on the basis that the sender *might* be able to
 send characters in the future that the receiver could not receive.  Sorry,
 but I can't buy into that.  That would prevent the CP1252 user from ever
 being able to communicate adequately with anyone who has only ISO 8859-1.
 
 What if I am trying to exchange mail with a user of Windows-1256?  Lots of
 roadblocks would be erected because of the chance that the guy *might* send
 me ARABIC LETTER ALEF WITH HAMZA BELOW and I couldn't interpret it.  And I
 couldn't exchange mail with UTF-8 users either, because of that YI SYLLABLE
 BBOP they might send me some day.
 
   Perhaps if a harder line was taken when characters
   are used that cannot be converted, this would make more sense.
   (ie give a very clear recognizable indication of corruption or
   conversion failures)
 
 That's reasonable.  Simply replacing unknown characters with '?' doesn't
 work; the character is too easily overlooked.  I would like to see mailers
 replace unsupported characters with a Unicode representation like [U+A068].
  (That would certainly help with this spate of CJK characters that people are
 sending lately on the Unicode list!)  I suspect that's too much Unicode
 awareness to ask of an otherwise Unicode-unaware product, though.
 
 -Doug Ewell
  Fullerton, California

-- 
---
Tex Texin  Director, International Business
mailto:[EMAIL PROTECTED]  +1-781-280-4271
Fax:+1-781-280-4655
the Progress Company   14 Oak Park, Bedford, MA 01730
---




Re: Is there Unicode mail out there?

2001-07-13 Thread DougEwell2

In a message dated 2001-07-13 5:27:41 Pacific Daylight Time, [EMAIL PROTECTED] 
writes:

  @š‚¶‚イ‚¢‚Á‚¿‚á‚ñš

  @Ž„‚͂낱‚¦‚ñ‚ç‚©‚ׂ³B

Robert, please stop this.  It doesn't seem to be UTF-8 (that is, I can't copy 
and paste it into UniPad or Windows 2000 Notepad and see anything 
reasonable), and even if it were, neither I nor many other list members can 
read Japanese.  We had this discussion earlier in the year about English vs. 
French, and other than exceptions like Patrick Andries' message (which was 
explicitly about a French translation), this is basically an English-language 
list.  It is certainly cool to ask questions about this or that Japanese 
character, but simply posting an unreadable Japanese response to my 
English-language message makes no sense.

-Doug Ewell
 Fullerton, California




Re: Is there Unicode mail out there?

2001-07-13 Thread DougEwell2

In a message dated 2001-07-13 4:06:39 Pacific Daylight Time, 
[EMAIL PROTECTED] writes:

  For ordinary users, i. e., those users who don't have the TUS 3.0
  tome lying next to their computers, a last resort glyph would
  probably be more helpful, cf. http://crl.nmsu.edu/~mleisher/lr.html
  and http://fonts.apple.com/LastResort/LastResort.html.

Unfortunately, the Windows world has no concept of a Last Resort font.  It 
would certainly seem to be a useful solution in cases like this.

-Doug Ewell
 Fullerton, California




RE: Is there Unicode mail out there?

2001-07-13 Thread Ayers, Mike


 From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]] 

 In a message dated 2001-07-13 5:27:41 Pacific Daylight Time, 
 [EMAIL PROTECTED] 
 writes:
 
   @š‚¶‚イ‚¢‚Á‚¿‚á‚ñš
 
   @Ž„‚͂낱‚¦‚ñ‚ç‚©‚ׂ³B
 
 Robert, please stop this.  It doesn't seem to be UTF-8 (that 
 is, I can't copy 
 and paste it into UniPad or Windows 2000 Notepad and see anything 

It's ISO-2022-JP, if that helps.

 character, but simply posting an unreadable Japanese response to my 
 English-language message makes no sense.

Ever think that maybe that's why he does it?  Anyway, here's a hint.
As someone who can read a little Japanese, I have never translated anything
in one of 11DB's messages that really mattered.  Anything that he wants us
to see is put in English, so you can probably safely ignore the question
marks.

On the other hand...

Yo, 11DB, get with the program!  Use UTF-8: where do ya think ya
are?  The point is to confuse people, not frustrate them.   ;-)


/|/|ike




Re: Is there Unicode mail out there?

2001-07-13 Thread Rick McGowan

Doug Ewell wrote...

   @š‚¶‚イ‚¢‚Á‚¿‚á‚ñš
   @Ž„‚͂낱‚¦‚ñ‚ç‚©‚ׂ³B

 Robert, please stop this.  It doesn't seem to be UTF-8 (that is, I can't copy  
 and paste it into UniPad or Windows 2000 Notepad and see anything
 reasonable)

Eeek.. What's that?  11's comment shows up fine in my mail reader here, as  
Japanese chars.  But what I got was, I believe, watashi wa rokoenrakabesa  
which isn't any Japanese that I can parse, and it should have a comma after  
wa in any case.  Roko isn't a word, though rouko and roukou are (and  
don't make sense here).  Besa isn't a verb ending, even in classical  
Japanese, and I can't imagine what it's supposed to mean.  Enraka isn't a  
word, and koen isn't a word though kouen is...  Hm.  It's gibberish  
anyway, so it wouldn't matter if it came through.

Just looks like nearly random syllables generated by someone who doesn't  
write the language.

Rick




FW: Re: Is there Unicode mail out there?

2001-07-13 Thread $B$F$s$I$&$j$e$&$8(B
Those are MOJIBAKE for my SIG.

1) I think that is mojibake for my name. It looks familiar.

2) The second one reads, if I rightly remember, "Watashi wa loco en la cabeza".


If I get a mojibakus or two in a Chinese sig, I don't say anything. (Is mojibakus the 
singular of mojibake? Perhaps "mojibakum"?)


$B$8$e$&$$$C$A$c$s(B

--- Original Message ---
$B:9=P?M(B: [EMAIL PROTECTED];
$B08@h(B: [EMAIL PROTECTED];
Cc: [EMAIL PROTECTED];
$BF|;~(B: 01/07/13 15:29
$B7oL>(B: Re: Is there Unicode mail out there?

In a message dated 2001-07-13 5:27:41 Pacific Daylight Time, [EMAIL PROTECTED] 
writes:

  $B%D!!%D!&!#c`TD%+c`TE!Wc`TD!"c`TD!Vc`TE"d?TD%=c`TE!#c`TE%"%D!&!#(B

  $B%D!!%J%9c`\d?TE:d?TE%)c`TD%"c`TD%rc`TE%"c`TE%!c`TD%%c`TENd?TD%&%D!#(B

Robert, please stop this.  It doesn't seem to be UTF-8 (that is, I can't copy 
and paste it into UniPad or Windows 2000 Notepad and see anything 
reasonable), and even if it were, neither I nor many other list members can 
read Japanese.  We had this discussion earlier in the year about English vs. 
French, and other than exceptions like Patrick Andries' message (which was 
explicitly about a French translation), this is basically an English-language 
list.  It is certainly cool to ask questions about this or that Japanese 
character, but simply posting an unreadable Japanese response to my 
English-language message makes no sense.

-Doug Ewell
 Fullerton, California




Re: FW: Re: Is there Unicode mail out there?

2001-07-13 Thread Rick McGowan

 Watashi wa loco en la cabeza

Duh, well, use katakana as appropriate, use middle-dots between your foreign  
words, and people might get it.

Rick




RE: Re: Is there Unicode mail out there?

2001-07-13 Thread Ayers, Mike


 From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]] 

 Those are MOJIBAKE for my SIG.

Which is what you deserve for not sending UTF-8.  Until you upgrade
your mailer, your name wil be @š‚¶‚イ‚¢‚Á‚¿‚á‚ñš.   
 :-p

 1) I think that is mojibake for my name. It looks familiar.

See above.

 2) The second one reads, if I rightly remember, Watashi wa 
 loco en la cabeza.

Yep: 私はろこえんらかべさ。You're still hung up on this use kana 
to 1represent
any language thing, huh?  You've got that in common with the Japanese - I
was quite surprised to find that most Japanese don't know that their
katakana versions of English words don't sound much like English words.
Anyway, if I ever meet a Spanish and Japanese fluent individual, I'll wave
it under their nose to see if they catch it.  They won't, though, since
you're using hiragana instead of katakana.

Here's some for you to transliterate:

1.)  The bull's nose ring is where we attach the taurine
towline.
2.)  Raul studies lore.
3.)  My file said he was vile.
4.)  Fu did that.  Fu who?

Etc., etc., etc.

 If I get a mojibakus or two in a Chinese sig, I don't say 
 anything. (Is mojibakus the singular of mojibake? Perhaps 
 mojibakum?)

You're the Japanese enthusiast - look it up!


/|/|ike




Re: Is there Unicode mail out there?

2001-07-13 Thread James Kass

Rick McGowan wrote:

 Eeek.. What's that?  11's comment shows up fine
 in my mail reader here, as Japanese chars.  But
 what I got was, I believe, watashi wa rokoenrakabesa
 which isn't any Japanese that I can parse, and it
 should have a comma after wa in any case.
 Roko isn't a word, though rouko and roukou
 are (and don't make sense here).  Besa isn't a verb
 ending, even in classical Japanese, and I can't imagine
 what it's supposed to mean.  Enraka isn't a word,
 and koen isn't a word though kouen is...  Hm.
 It's gibberish anyway, so it wouldn't matter if it
 came through.

How's your Spanish, Rick?

Try watashi wa as Japanese and roko en ra kabesa
as Spanish...  (keeping in mind that Japanese doesn't
distinguish between r and l, of course.)

Best regards,

James Kass.
​






Re: Is there Unicode mail out there?

2001-07-13 Thread Gaute B Strokkenes

On Fri, 13 Jul 2001, [EMAIL PROTECTED] wrote:
 
 From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]] 
 
 Those are MOJIBAKE for my SIG.
 
   Which is what you deserve for not sending UTF-8.  Until you
 upgrade your mailer, your name wil be 
?@?š‚¶‚イ‚¢‚Á‚¿‚á‚ñ?š.  :-p

No way.  Any mail client that is sufficiently clever to understand
UTF-8 should understand all valid and registered MIME-charsets.  After
all, conversion libraries are both widely available and easy to use.
[I can see you put a smiley after your statement so I realise you were
probably being sarcastic, but I thought that this could bear pointing
out.]

All the `all messages should be in UTF-8, even when there are
well-established legacy encodings that cover the characters of a given
message' mumbo-jumbo that has been mentioned recently on the list is
really just so much hot air.  Firstly, mail clients will not be able
to deprecate support for other charsets even if UTF-8 is widely
adopted (which it isn't--for email) because of the need to be able to
interpret the masses of existing messges.  Secondly, maintaining such
support is, as pointed out above, extremely easy to do.  Thirdly,
there are a great number of clients out there that do not support
UTF-8 and are unlikely to do so in the immediate future, either
because of internal limitations in the software that are hard to
remove or because people don't upgrade.  I think it's antisocial to
say `Well, I _could_ have used a charset that would have enabled you
to read my message but I decided not to, for no particularly good
reason.'

On the other hand it makes sense to say `Sorry, but UTF-8 is the only
charset that will do since I wanted to use Etruscan, Russian and
Japanese characters and UTF-8 is the only sane way to do this.'
That's the only benefit that Unicode and UTF-8 will bring to email:
the ability to mix and match characters from all scripts of all sizes
and shapes in a single message.  OTOH, for those of us who need this
it's a big advantage.

Another thing that some people may worry about is the bad interaction
between quoted-unprintable and UTF-8 (or any non-West European / North
American coding in general, but for UTF-8 it's even worse): 6 bytes
for a single Cyrillic character?  Ye gods.  [I could start another
rant about how bad an idea QP was in the first place, but that's
off-topic here.]

-- 
Gaute Strokkeneshttp://www.srcf.ucam.org/~gs234/
I am NOT a nut




Re: Is there Unicode mail out there?

2001-07-12 Thread DougEwell2

In a message dated 2001-07-11 15:03:27 Pacific Daylight Time, 
[EMAIL PROTECTED] writes:

  One exception to this should be US-ASCII because not only the repertoire
  of US-ASCII is a subset of the repertoire of UTF-8 but also the
  representation of all characters in US-ASCII is identical in UTF-8.
  A smart mail client would notice that all characters
  are in US-ASCII repertoire  and label outgoing messages as in
  US-ASCII EVEN if it's configured to label outgoing messages
  in UTF-8
[...]

I thought this might even be enshrined in an RFC.  It certainly makes sense.  
If you are using a mailer that sends CP1252 down the wire (not that this is a 
good idea, but some mailers do this), the mailer should examine the message 
and if it only contains US-ASCII characters, the message should be tagged as 
US-ASCII.  Otherwise, if it only contains ISO 8859-1, it should be tagged as 
ISO 8859-1.  Only if it actually contains CP1252 characters, like smart 
quotes or long dashes, should it be tagged as CP1252.  As Jungshik observed, 
the same goes for UTF-8.

-Doug Ewell
 Fullerton, California




Re: Is there Unicode mail out there?

2001-07-12 Thread James Kass

Please disregard my previous message about a work-around
for Outlook Express problem.

Although it works, non-UTF-8 messages are no longer being
properly displayed, an unacceptable trade-off.

Another possibility which was tested was to add an innocuous
character which isn't included in any code page to the
signature.  Tried the zero-width space.  When copying the
zero-width space into the signature of a message being sent
in reply to a message encoded as Thai (Windows), Outlook
Express prompted to Send as Unicode... when the letter
was tagged to be sent later.  So far, so good.

Figured it would be possible to set up a signature with
ZWS to eliminate the necessity of manually changing the 
encoding of messages being sent to UTF-8 every time a 
message is sent.  Unfortunately, on Windows M.E., the 
signature  information is stored in the Registry, and it's ASCII.  
So, the ZWS got converted to a question mark and doesn't
get switched back when it's added to a message.

So, tried setting up a signature file to be added to each
outgoing message including the ZWS.  In this case, MSOE
displays the UTF-8 ZWS as mojibake (gibberish) when the
signature is added to the outgoing message.

Perhaps a future version of Outlook will correct the
problem.

Best regards,

James Kass.






Re: Is there Unicode mail out there?

2001-07-12 Thread James

[EMAIL PROTECTED] wrote:
 
 In a message dated 2001-07-11 15:03:27 Pacific Daylight Time,
 [EMAIL PROTECTED] writes:
 
   One exception to this should be US-ASCII because not only the repertoire
   of US-ASCII is a subset of the repertoire of UTF-8 but also the
   representation of all characters in US-ASCII is identical in UTF-8.
   A smart mail client would notice that all characters
   are in US-ASCII repertoire  and label outgoing messages as in
   US-ASCII EVEN if it's configured to label outgoing messages
   in UTF-8
 [...]
 
 I thought this might even be enshrined in an RFC.  It certainly makes sense.
 If you are using a mailer that sends CP1252 down the wire (not that this is a
 good idea, but some mailers do this), the mailer should examine the message
 and if it only contains US-ASCII characters, the message should be tagged as
 US-ASCII. 

The RFCs/BCPs do encourage using as minimal a charset as possible.

Anyway, UTF-8 email is nowhere right now. Kat Momoi of Netscape has suggested
that about the only this could change is if email client vendors turn it
on by default in new product releases. I won't be the first!

Having done a lot of email client programming using the RFCs as a basis,
let me say that in general RFCs are vague, and not always the best practice
for interoperability when it comes to email.

For example, CRLF in message bodies is recommended, but actually reduces
interoperability, particularly with subversions of IE 5. So I don't know
of any email client that does it. And quoted-printable is way too
complicated to expect conforming implementations.

And don't get me started about all the random charsets that RFCs promote that
nobody adopts!

James.




Re: Is there Unicode mail out there?

2001-07-12 Thread James Kass

Here's a work-around that seems to work.

Added the ZWS after the signature in a signature file.
Because the mojibake for ZWS includes the Euro
currency symbol, OE prompts to 'send as Unicode'
when replying to a non-UTF-8 sender.

Of course, the time saved by not having to manually
change the encoding will probably be less than the
time lost explaining what the junk is under my name.

Best regards,

James Kass.
​






Re: Is there Unicode mail out there?

2001-07-12 Thread Jungshik Shin




On Thu, 12 Jul 2001, James Kass wrote:

 Here's a work-around that seems to work.

 Added the ZWS after the signature in a signature file.
 Because the mojibake for ZWS includes the Euro
 currency symbol, OE prompts to 'send as Unicode'
 when replying to a non-UTF-8 sender.

  Mysterious is why this prompting (by MS OE) did not happen to Mike
Ayers when he replied to Peter's message with Thai string in Windows-874
adding some Chinese characters while MS OE (5.50.x) I tried certainly
prompted me to pick one of three (1. send as Unicode, 2. send as is -
in Windows-874 - risking loss of info. 3. cancel) when I did the same
thing. ZWS and Chinese characters have no reason to be treated differently
when added to a Windows-874 encoded message.

  BTW, Mozilla/Netscape 6 also uses the encoding of the message
(or its closest match among IANA-registered MIME charsets. Thus, in place
of Windows-874, Mozilla/Netscape 6 uses TIS-620) you're replying to by
default. When one adds some characters outside the repertoire of that
encoding, it warns that there are some characters not representable in the
current encoding and it's necessary to change the encoding to something
that can represent all characters. (it does not suggest Unicode.) It
offers two options : go ahead despite potential loss of some characters
or cancel and change the encoding.

  Perhaps, both Mozilla/Netscape 6 and MS OE should have an option (
'toggle-switchable') to let users  specify that their preferred encoding
(set in preference) be used by default regardless of the encoding of
messages they're replying to.

   Jungshik Shin





Re: Is there Unicode mail out there?

2001-07-12 Thread Peter_Constable


  Hmm, it didn't work either.
OK, one more try -- Thai test, take 3: กลัปมาอยู่แล้ว


- Peter


---
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485​
E-mail: [EMAIL PROTECTED]




Re: Is there Unicode mail out there?

2001-07-12 Thread Tex Texin

(I didnt read all the thread so maybe I missed a step).

So the proposal is that minimizing the charset is a good thing?

This means that you and I start out in a conversation about a
product I am trying to sell you, it happens to be all in ascii
and we exchange several mails successfully. Then I quote you
a price in Euros and my 1252 message gets corrupted by your
reader which can handle either only 8859-1 or ASCII, and
you miss the fact that the Euro is corrupted and think we
are talking dollars or some other currency.

Although I understand why you would want a minimal charset in order
to not needlessly prevent communications, the implication of
reliability and trust that is built by having some success is
a problem. You think you are communicating successfully but when it
is critical it may not...

Perhaps if a harder line was taken when characters
are used that cannot be converted, this would make more sense.
(ie give a very clear recognizable indication of corruption or
conversion failures)

tex



[EMAIL PROTECTED] wrote:
 
 In a message dated 2001-07-11 15:03:27 Pacific Daylight Time,
 [EMAIL PROTECTED] writes:
 
   One exception to this should be US-ASCII because not only the repertoire
   of US-ASCII is a subset of the repertoire of UTF-8 but also the
   representation of all characters in US-ASCII is identical in UTF-8.
   A smart mail client would notice that all characters
   are in US-ASCII repertoire  and label outgoing messages as in
   US-ASCII EVEN if it's configured to label outgoing messages
   in UTF-8
 [...]
 
 I thought this might even be enshrined in an RFC.  It certainly makes sense.
 If you are using a mailer that sends CP1252 down the wire (not that this is a
 good idea, but some mailers do this), the mailer should examine the message
 and if it only contains US-ASCII characters, the message should be tagged as
 US-ASCII.  Otherwise, if it only contains ISO 8859-1, it should be tagged as
 ISO 8859-1.  Only if it actually contains CP1252 characters, like smart
 quotes or long dashes, should it be tagged as CP1252.  As Jungshik observed,
 the same goes for UTF-8.
 
 -Doug Ewell
  Fullerton, California

-- 
---
Tex Texin  Director, International Business
mailto:[EMAIL PROTECTED]  +1-781-280-4271
Fax:+1-781-280-4655
the Progress Company   14 Oak Park, Bedford, MA 01730
---




Re: Is there Unicode mail out there?

2001-07-12 Thread $B$F$s$I$&$j$e$&$8(B
My other e-mail was a real "moji-baka", I'd say. That would be a good term, 
$BJ8;zGO(B: Re: Is there Unicode mail out there?

(I didnt read all the thread so maybe I missed a step).

So the proposal is that minimizing the charset is a good thing?

This means that you and I start out in a conversation about a
product I am trying to sell you, it happens to be all in ascii
and we exchange several mails successfully. Then I quote you
a price in Euros and my 1252 message gets corrupted by your
reader which can handle either only 8859-1 or ASCII, and
you miss the fact that the Euro is corrupted and think we
are talking dollars or some other currency.

Although I understand why you would want a minimal charset in order
to not needlessly prevent communications, the implication of
reliability and trust that is built by having some success is
a problem. You think you are communicating successfully but when it
is critical it may not...

Perhaps if a harder line was taken when characters
are used that cannot be converted, this would make more sense.
(ie give a very clear recognizable indication of corruption or
conversion failures)

tex



[EMAIL PROTECTED] wrote:
 
 In a message dated 2001-07-11 15:03:27 Pacific Daylight Time,
 [EMAIL PROTECTED] writes:
 
   One exception to this should be US-ASCII because not only the repertoire
   of US-ASCII is a subset of the repertoire of UTF-8 but also the
   representation of all characters in US-ASCII is identical in UTF-8.
   A smart mail client would notice that all characters
   are in US-ASCII repertoire  and label outgoing messages as in
   US-ASCII EVEN if it's configured to label outgoing messages
   in UTF-8
 [...]
 
 I thought this might even be enshrined in an RFC.  It certainly makes sense.
 If you are using a mailer that sends CP1252 down the wire (not that this is a
 good idea, but some mailers do this), the mailer should examine the message
 and if it only contains US-ASCII characters, the message should be tagged as
 US-ASCII.  Otherwise, if it only contains ISO 8859-1, it should be tagged as
 ISO 8859-1.  Only if it actually contains CP1252 characters, like smart
 quotes or long dashes, should it be tagged as CP1252.  As Jungshik observed,
 the same goes for UTF-8.
 
 -Doug Ewell
  Fullerton, California

-- 
---
Tex Texin  Director, International Business
mailto:[EMAIL PROTECTED]  +1-781-280-4271
Fax:+1-781-280-4655
the Progress Company   14 Oak Park, Bedford, MA 01730
---




Re: Is there Unicode mail out there?

2001-07-12 Thread Jungshik Shin


On Thu, 12 Jul 2001 [EMAIL PROTECTED] wrote:

   Hmm, it didn't work either.
 OK, one more try -- Thai test, take 3: กลัปมาอยู่แล้ว

   Finally, you succeeded ! Congratulations :-). Could you
explain what you did differently this time so that other Lotus
Notes users can benefit from your experience/experiment?

  Jungshik Shin





RE: Is there Unicode mail out there?

2001-07-12 Thread Ayers, Mike


 From: Jungshik Shin [mailto:[EMAIL PROTECTED]] 

   Mysterious is why this prompting (by MS OE) did not happen to Mike
 Ayers when he replied to Peter's message with Thai string in 
 Windows-874
 adding some Chinese characters while MS OE (5.50.x) I 
 tried certainly
 prompted me to pick one of three (1. send as Unicode, 2. send as is -
 in Windows-874 - risking loss of info. 3. cancel) when I did the same
 thing. ZWS and Chinese characters have no reason to be 
 treated differently
 when added to a Windows-874 encoded message.

Not mysterious really, I'm using Outlook, not Outlook Express.
Despite the similarity of names, the differences seem to be considerable.
It is disturbing, though, that the premium product has less desireable
behavior than the free one in this case.


/|/|ike




Re: Is there Unicode mail out there?

2001-07-12 Thread James Kass


Jungshik Shin wrote:

   Perhaps, both Mozilla/Netscape 6 and MS OE should have an option (
 'toggle-switchable') to let users  specify that their preferred encoding
 (set in preference) be used by default regardless of the encoding of
 messages they're replying to.


It would be nice...

MS OE appeared to already have the option.  Under Tools-Options-
Send, there's a check-box for Reply to messages using the format
in which they were sent.  Under Tools-Options-Send-International
Settings, there's a provision for the user to choose a default
encoding and a check-box to Use the following default encoding for
outgoing messages:.  Even though this system was set up
accordingly, outgoing messages which were replies to messages
in non-UTF-8 encodings weren't being sent in UTF-8, to my
surprise, chagrin, and dismay.

Best regards,

James Kass.
​





RE: Is there Unicode mail out there?

2001-07-12 Thread Chris Wendt

In any case, no matter if new message or reply or forward, you can force
OE to use a specific encoding using the Format.Encoding menu. There is
no option to ALWAYS use a specific encoding in replies and forwards, you
will have to choose manually each time. OE itself has no option to
automatically determine the best outbound encoding (and I agree that
generally the encoding with the smallest repertoire is the best). OE
will only suggest UTF-8 and will not suggest any other charset, if the
chosen encoding does not hold the characters used.

Note: an HTML message to an HTML4 capable recipient will transport any
character regardless of the chosen encoding. That might explain the
different results you are seeing when sending to differently enabled
recipients.

Replying in the charset of the original message is in my view reasonable
behavior: the recipient of your reply has the best chance to read the
message in the encoding the original message was sent. Changing the
encoding decreases the chance the replyee will be able to read your
message.


-Original Message-
From: James Kass [mailto:[EMAIL PROTECTED]] 
Sent: Thursday, July 12, 2001 1:18 PM
To: Jungshik Shin
Cc: Unicode List
Subject: Re: Is there Unicode mail out there?



Jungshik Shin wrote:

   Perhaps, both Mozilla/Netscape 6 and MS OE should have an option (
 'toggle-switchable') to let users  specify that their preferred 
 encoding (set in preference) be used by default regardless of the 
 encoding of messages they're replying to.


It would be nice...

MS OE appeared to already have the option.  Under Tools-Options- Send,
there's a check-box for Reply to messages using the format in which
they were sent.  Under Tools-Options-Send-International Settings,
there's a provision for the user to choose a default encoding and a
check-box to Use the following default encoding for outgoing
messages:.  Even though this system was set up accordingly, outgoing
messages which were replies to messages in non-UTF-8 encodings weren't
being sent in UTF-8, to my surprise, chagrin, and dismay.

Best regards,

James Kass.
​






RE: Is there Unicode mail out there?

2001-07-12 Thread Ayers, Mike


 From: Chris Wendt [mailto:[EMAIL PROTECTED]] 

 Replying in the charset of the original message is in my view 
 reasonable
 behavior: the recipient of your reply has the best chance to read the
 message in the encoding the original message was sent. Changing the
 encoding decreases the chance the replyee will be able to read your
 message.

For person-to-person emails, this makes sense.  It does not hold up
for mailing lists, however - it's not necessarily unreasonable behavior, but
the odds of readability for mailing lists are fixed to the character set,
regardless of the character set used in any individual mailing (note that
the Windows Thai character set could not be viewed by many people - changed
to UTF-8, almost everyone could read it).  For this reason, I would really
like to see option controlled behavior (use the current behavior as a
default).


/|/|ike




Re: Is there Unicode mail out there?

2001-07-12 Thread James Kass

Chris Wendt wrote:

 Replying in the charset of the original message
 is in my view reasonable behavior: the recipient
 of your reply has the best chance to read the
 message in the encoding the original message
 was sent. Changing the encoding decreases the
 chance the replyee will be able to read your
 message.

When a user issues an instruction to a computer, it
is a command rather than a request.  If a user selects
the option to Use the following default encoding for
outgoing messages:, then the expected behavior is
compliance.

Of course, you are quite right in that the recipient
is more likely to be able to read a message sent in the
recipient's default.  As we move towards a World encoding
standard, perhaps more applications will use the standard
as default.

This message is being sent in Arabic (Windows) because
it is in reponse to a message sent in that encoding.  The
author of the original message has noted my work-around
and has cleverly prevented it by selecting a code-page
which includes the special character I'm using for the
kludge.

Best regards,

James Kass.
​






Re: Is there Unicode mail out there?

2001-07-11 Thread Otto Stolz

[EMAIL PROTECTED] had asked:
 Is there Unicode mail out there?

On Sun, 8 Jul 2001 03:40:51 -0700 James Kass wrote:
 Microsoft's Outlook Express offers many e-mail encoding
 options, including Unicode (UTF-8) and responding to the
 sender in the same encoding as the sender's message.  And,
 it won't cost you money.

Same with Netscape 6.01; though it still has some teething problems.

Best wishes,
  Otto Stolz




Re: Is there Unicode mail out there?

2001-07-11 Thread dabhijit

Can you read this? This is coming from Lotus Notes.




Otto Stolz [EMAIL PROTECTED] on 07/11/2001 10:43:10 PM

Please respond to Otto Stolz [EMAIL PROTECTED]

To:   Unicode List [EMAIL PROTECTED]
cc:   11 [EMAIL PROTECTED] (bcc: Dutta Abhijit/India/IBM)
Subject:  Re: Is there Unicode mail out there?




[EMAIL PROTECTED] had asked:
 Is there Unicode mail out there?

On Sun, 8 Jul 2001 03:40:51 -0700 James Kass wrote:
 Microsoft's Outlook Express offers many e-mail encoding
 options, including Unicode (UTF-8) and responding to the
 sender in the same encoding as the sender's message.  And,
 it won't cost you money.

Same with Netscape 6.01; though it still has some teething problems.

Best wishes,
  Otto Stolz








Re: Is there Unicode mail out there?

2001-07-11 Thread Michael \(michka\) Kaplan

From: [EMAIL PROTECTED]

 Can you read this? This is coming from Lotus Notes.
 
 

Yes, it looks like you are confused (all those question marks!)

MichKa

Michael Kaplan
Trigeminal Software, Inc.
http://www.trigeminal.com/







Re: Is there Unicode mail out there?

2001-07-11 Thread Otto Stolz

On Wed, 11 Jul 2001 15:41:28 +0530, [EMAIL PROTECTED] wrote:
 Can you read this?
 

It's 8 question marks in a row. I don't know what you had expected.

Note that you have sent these header fields:
 Mime-Version: 1.0
 Content-type: text/plain; charset=us-ascii

This announces 7-bit ASCII, cf. http://czyborra.com/charsets/iso646.html,
so your message cannot contain any other character.

 This is coming from Lotus Notes.

I haven't tried Lotus Notes, so I cannot tell you how to persuade it
to send non-ASCII text (with proper headers, of course), or whether
this is possible at all.

Best wishes,
  Otto Stolz




Re: Is there Unicode mail out there?

2001-07-11 Thread James Kass

Michael Kaplan wrote:

 From: [EMAIL PROTECTED]
 
  Can you read this? This is coming from Lotus Notes.
  
  
 
 Yes, it looks like you are confused (all those question marks!)
 

Maybe it's the Lotus Notes that's confused rather than 
Dutta Abhijit?

Best regards,

James Kass.






Re: Is there Unicode mail out there?

2001-07-11 Thread DougEwell2

In a message dated 2001-07-11 3:26:25 Pacific Daylight Time, 
[EMAIL PROTECTED] writes:

  Can you read this? This is coming from Lotus Notes.

  

'Nuff said.

I received this with CompuServe 5.0 (similar to AOL 5.0, imagine that).  I 
don't know if CompuServe 6.0 is any better.

-Doug Ewell
 Fullerton, California




Re: Is there Unicode mail out there?

2001-07-11 Thread Peter_Constable


Can you read this? This is coming from Lotus Notes.



Notes can handle Unicode characters, at least going from one Notes user to
another within our system. Once it goes out on to the Internet, there may
be other processes intervening that munge the data.

In the Basics tab of the User Preferences dialog (I'm using R5.0.5), under
Additional Options I've got Enable Unicode display enabled; under the
Main and News tab, in the Multilingual Internet Mail drop down, select Use
Unicode (UTF-8). Just as a test, here's a bit of Thai: ¡ÅÑ»ÁÒÍÂÙèáÅéÇ


- Peter


---
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: [EMAIL PROTECTED]




RE: Is there Unicode mail out there?

2001-07-11 Thread Addison Phillips [wM]

After all the various replies that say gosh, I can't read this, I thought
it might be helpful to point out this section of Abijit's email headers:

Content-type: text/plain; charset=us-ascii

The outbound mailer (even in Notes, which is a pretty well internationalized
application, although they bury the settings that control this specific
capability!!) can send UTF-8, as far as I remember, plus a raft of legacy
encodings. In this case either the user's mail client or the mailer itself
is set to send US-ASCII. Since I don't have Notes installed these days, I
can't say where the controls are that change the settings (I certainly don't
remember), but I do recall that I was able, as a Notes user in the past, to
set my encoding. That would quite possibly make the string of eight unknown
characters visible to the list. Note that this has nothing to do with which
mailer you are receiving the mail with or with Sarasvati's capabilities or
anything: the message was converted to nothing before it left the sender.

In most cases in my recent experience, settings on the mailer or mail client
itself prevent a proper Unicode message from being generated. The mailers
themselves rarely care about the encoding: as long as it obeys RFCs
822/1341/1342 they are happy. Most of the more modern GUI mail clients can
handle UTF-8. Yes, there are older or text-mode clients that can't deal with
it, but in my experience it is getting to the point that there are
(generally, generally) more problems with getting the settings set to send
than with receivers receiving!

Best Regards,

Addison

Addison P. Phillips
Globalization Architect / Manager, Globalization Engineering
webMethods, Inc.  432 Lakeside Drive, Sunnyvale, CA
+1 408.962.5487 (phone)  +1 408.210.3659 (mobile)
-
Internationalization is an architecture. It is not a feature.

 -Original Message-
 From: [EMAIL PROTECTED]
 [mailto:[EMAIL PROTECTED]]On Behalf Of [EMAIL PROTECTED]
 Sent: Wednesday, July 11, 2001 3:11 AM
 To: Otto Stolz
 Cc: Unicode List
 Subject: Re: Is there Unicode mail out there?


 Can you read this? This is coming from Lotus Notes.

 


 Otto Stolz [EMAIL PROTECTED] on 07/11/2001 10:43:10 PM

 Please respond to Otto Stolz [EMAIL PROTECTED]

 To:   Unicode List [EMAIL PROTECTED]
 cc:   11 [EMAIL PROTECTED] (bcc: Dutta Abhijit/India/IBM)
 Subject:  Re: Is there Unicode mail out there?




 [EMAIL PROTECTED] had asked:
  Is there Unicode mail out there?

 On Sun, 8 Jul 2001 03:40:51 -0700 James Kass wrote:
  Microsoft's Outlook Express offers many e-mail encoding
  options, including Unicode (UTF-8) and responding to the
  sender in the same encoding as the sender's message.  And,
  it won't cost you money.

 Same with Netscape 6.01; though it still has some teething problems.

 Best wishes,
   Otto Stolz











Re: Is there Unicode mail out there?

2001-07-11 Thread Mark Davis

Yes, that works fine. The Thai comes through clearly: ¡ÅÑ»ÁÒÍÂÙèáÅéÇ

Mark
- Original Message -
From: [EMAIL PROTECTED]
To: Unicode List [EMAIL PROTECTED]
Sent: Wednesday, July 11, 2001 09:33
Subject: Re: Is there Unicode mail out there?



 Can you read this? This is coming from Lotus Notes.
 
 

 Notes can handle Unicode characters, at least going from one Notes user to
 another within our system. Once it goes out on to the Internet, there may
 be other processes intervening that munge the data.

 In the Basics tab of the User Preferences dialog (I'm using R5.0.5), under
 Additional Options I've got Enable Unicode display enabled; under the
 Main and News tab, in the Multilingual Internet Mail drop down, select
Use
 Unicode (UTF-8). Just as a test, here's a bit of Thai: ¡ÅÑ»ÁÒÍÂÙèáÅéÇ


 - Peter


 --
-
 Peter Constable

 Non-Roman Script Initiative, SIL International
 7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
 Tel: +1 972 708 7485
 E-mail: [EMAIL PROTECTED]







RE: Is there Unicode mail out there?

2001-07-11 Thread Ayers, Mike


 From: Mark Davis [mailto:[EMAIL PROTECTED]] 
 
 Yes, that works fine. The Thai comes through clearly: ¡ÅÑ»ÁÒÍÂÙèáÅéÇ
 

Woohoo!!! UTF-8 party!!!  ???!!!


/|/|ike




Re: Is there Unicode mail out there?

2001-07-11 Thread Jungshik Shin

On Wed, 11 Jul 2001 [EMAIL PROTECTED] wrote:

 In the Basics tab of the User Preferences dialog (I'm using R5.0.5), under
 Additional Options I've got Enable Unicode display enabled; under the
 Main and News tab, in the Multilingual Internet Mail drop down, select Use
 Unicode (UTF-8). Just as a test, here's a bit of Thai: ¡ÅÑ»ÁÒÍÂÙèáÅéÇ

  Your mail has the following header, which indicates that
it's in 'Windows-874' encoding. I'm not sure whether that encoding name
is registered with IANA for use in MIME.

 X-Mailer: Lotus Notes Release 5.0.5  September 22, 2000
 MIME-Version: 1.0
 Content-type: text/plain; charset=Windows-874
^^^
  Anyway, to get your message properly recognized as in UTF-8
by other MIME-compliant mail programs (MS OE and Netscape 6.x/Mozilla,
Pine, Mutt, etc), you have to find a way to make Lotus Notes
add the correct MIME header for UTF-8 message as shown below:

  Content-type: text/plain; charset=UTF-8
  Content-Transfer-Encoding: (8bit|base64|quoted-printable)

  I'm not sure if that's possible in Lotus Notes, though. MS OE and
Netscape 6.x/Mozilla, Mutt and Pine work well with UTF-8 messages (for
the latter two, obviously you need to have a terminal to support UTF-8)

  Jungshik Shin





Re: Is there Unicode mail out there?

2001-07-11 Thread Peter_Constable


Yes, that works fine. The Thai comes through clearly: ¡ÅÑ»ÁÒÍÂÙèáÅéÇ

And my own message came back to me with the Thai as I originally sent it.
So, I'm getting UTF-8 going out and coming in with nothing messing it up in
between. If other Notes users aren't getting the same results, check the
version of your client (I don't know if R4.x could handle Unicode or not),
and check your preferences.



- Peter


---
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: [EMAIL PROTECTED]




Re: Is there Unicode mail out there?

2001-07-11 Thread Jungshik Shin




On Wed, 11 Jul 2001, Mark Davis wrote:

 - Original Message -
 From: [EMAIL PROTECTED]
 Sent: Wednesday, July 11, 2001 09:33

  Main and News tab, in the Multilingual Internet Mail drop down, select
 Use
  Unicode (UTF-8). Just as a test, here's a bit of Thai: 
กลัปมาอยู่แล้ว

 Yes, that works fine. The Thai comes through clearly: ¡ÅÑ»ÁÒÍÂÙèáÅéÇ

  Well, it was not in UTF-8, though. It was encoded in Windows-874
(for Thai) and was flagged as such in Content-Type header of the message.

   Conetnt-Type: text/plain; charset=Windows-874

  In my previous response, I thought actual encoding used was UTF-8, but
Lotus Notes  put the incorrect charset parameter value in C-T header. That
turned out not to be the case. At least, there's NO inconsistency between
what's used in the message body and what the message header indicated
was used in the message body.

  I'm writing this email with Pine running inside
UTF-8 enabled xterm with the following line added to display filter
spec. of my pinerc (Pine configuration file)

  _CHARSET(Windows-874)_ /usr/bin/iconv -f CP874 -t UTF-8

   Unlike my previous message (which include Windows-874 encoded
string in Thai but marked as in UTF-8 because I thought that Thai string
was in UTF-8), this message should have Thai string encoded in UTF-8
(as indicated by C-T header).

   Jungshik Shin





RE: Is there Unicode mail out there?

2001-07-11 Thread Ayers, Mike


Okay, I sent these as UTF-8, with some Chinese where the question
marks are.  However, the Chinese is getting eaten somewhere along the way.
Oddly, though, the Thai still displays fine.  Would any Outlook XP guru
volunteer to help me get back to my international ways?

Final test:  


 From: Ayers, Mike [mailto:[EMAIL PROTECTED]] 
 
   Let's try this again...
 
   From: Mark Davis [mailto:[EMAIL PROTECTED]] 
   
   Yes, that works fine. The Thai comes through clearly: 
 ¡ÅÑ»ÁÒÍÂÙèáÅéÇ
   
 
   Woohoo!!!  UTF-8 party!!!  ???!!!
 
  
  /|/|ike
  
 




RE: Is there Unicode mail out there?

2001-07-11 Thread Ayers, Mike


Let's try this again...

  From: Mark Davis [mailto:[EMAIL PROTECTED]] 
  
  Yes, that works fine. The Thai comes through clearly: ¡ÅÑ»ÁÒÍÂÙèáÅéÇ
  

Woohoo!!!  UTF-8 party!!!  ???!!!

 
 /|/|ike
 




Re: Is there Unicode mail out there?

2001-07-11 Thread Peter_Constable


 Unicode (UTF-8). Just as a test, here's a bit of Thai: ¡Ƒ»’΂ڨ↩ō
  Your mail has the following header, which indicates that
it's in 'Windows-874' encoding. I'm not sure whether that encoding name
is registered with IANA for use in MIME.

 X-Mailer: Lotus Notes Release 5.0.5  September 22, 2000
 MIME-Version: 1.0
 Content-type: text/plain; charset=Windows-874

OK, I didn't look closely at the header, just at the result. Here's another
test that will be telling - I don't know of any codepage / charset for
Ethiopic: ሀሁሂሃ



- Peter


---
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: [EMAIL PROTECTED]




RE: Is there Unicode mail out there?

2001-07-11 Thread Jungshik Shin

On Wed, 11 Jul 2001, Ayers, Mike wrote:

   Okay, I sent these as UTF-8, with some Chinese where the question
 marks are.  However, the Chinese is getting eaten somewhere along the way.
 Oddly, though, the Thai still displays fine.  Would any Outlook XP guru
 volunteer to help me get back to my international ways?

   Final test:  

  Nothing cryptic. As with others on this thread, your problem is
to mistake Windows-874 (legacy encoding for Thai) for UTF-8. Because
Windows-874 does NOT cover Chinese characters, they turned into
'?'. Judging from your message hader, you're not using MS OE
but something different.

 X-Mailer: Internet Mail Service (5.5.2653.19)
 Content-Type: text/plain; charset=windows-874
 Content-Transfer-Encoding: 8bit


   MS OE 5.x is smart enough to detect characters (in your
reply. in this case  Chinese characters) not covered by the repertoire
of MIME charrset (in this case, Windows-874) of  the message you're
replying to (by default, whch is also the MIME charset of your reply)
and to prompt users to answer whether to use UTF-8 or not explaining
that some of characters are not representable in the default encoding
(the encoding of the message you're replying to) and will be lost.
You can also configure MS OE  to always use UTF-8 (or whatever
encoding of your choice) regardless of the encoding of
messages you're replying to.

  From: Ayers, Mike [mailto:[EMAIL PROTECTED]]
 
  Let's try this again...
 
From: Mark Davis [mailto:[EMAIL PROTECTED]]
   
Yes, that works fine. The Thai comes through clearly:
  กลัปมาอยู่แล้ว
   
 
  Woohoo!!!  UTF-8 party!!!  ???!!!

  No, it should have been Windows-874 party !! :-).
Both Mark Davis and Peter Constable sent   messages in Windows-874
beleiving that they're using UTF-8.

   However, I'm sending this in UTF-8 (after automatic conversion by
my mail client, Pine 4.33).


Jungshik Shin





Re: Is there Unicode mail out there?

2001-07-11 Thread DougEwell2

In a message dated 2001-07-11 13:31:54 Pacific Daylight Time, 
[EMAIL PROTECTED] writes:

  OK, I didn't look closely at the header, just at the result. Here's another
  test that will be telling - I don't know of any codepage / charset for
  Ethiopic: ሀሁሂሃ

Everything came out fine.  Of course, what I saw was the raw bytes, 
interpreted as CP1252, but I just cut and pasted them into SC UniPad and 
everything came out fine (except for the fact that UniPad doesn't have 
Ethiopic glyphs yet...).

The header revealed the encoding Peter used:

  Content-type: text/plain; charset=UTF-8

-Doug Ewell
 Fullerton, California




RE: Is there Unicode mail out there?

2001-07-11 Thread Addison Phillips [wM]

I think you'll find that Peter's response applies to you too: the mailer is seeing 
Windows-874 on the incoming message and converting your outgoing message to use that 
same encoding (in a bid to be compatible with the original message). Outlook has done 
that for awhile. If you manually set the encoding for the reply you can override that 
behavior. In Outlook 2000 this is Format | Encoding

Best Regards,

Addison

 -Original Message-
 From: [EMAIL PROTECTED] 
 [mailto:[EMAIL PROTECTED]]On Behalf Of Ayers, Mike
 Sent: Wednesday, July 11, 2001 12:42 PM
 To: Unicode List
 Subject: RE: Is there Unicode mail out there?
 
 
 
   Okay, I sent these as UTF-8, with some Chinese where 
 the question
 marks are.  However, the Chinese is getting eaten somewhere 
 along the way.
 Oddly, though, the Thai still displays fine.  Would any 
 Outlook XP guru
 volunteer to help me get back to my international ways?
 
   Final test:  
 
 
  From: Ayers, Mike [mailto:[EMAIL PROTECTED]] 
  
  Let's try this again...
  
From: Mark Davis [mailto:[EMAIL PROTECTED]] 

Yes, that works fine. The Thai comes through clearly: 
  กลัปมาอยู่แล้ว

  
  Woohoo!!!  UTF-8 party!!!  ???!!!
  
   
   /|/|ike
   
  
 
 





Re: Is there Unicode mail out there?

2001-07-11 Thread Peter_Constable


  Now the question is whether it's possible to force Lotus Notes
to use UTF-8 as  the encoding of the outgoing message  EVEN WHEN
characters in the message are all covered by   existing
encoding other than UTF-8 (e.g. Windows-874 for Thai).

Well, I'm going to try one more thing -- Thai test, take 2: ¡ÅÑ»ÁÒÍÂÙèáÅéÇ


- Peter


---
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: [EMAIL PROTECTED]




Re: Is there Unicode mail out there?

2001-07-11 Thread Jungshik Shin




On Wed, 11 Jul 2001 [EMAIL PROTECTED] wrote:

  Unicode (UTF-8). Just as a test, here's a bit of Thai: ¡Ƒ»’΂ڨ↩ō
   Your mail has the following header, which indicates that
 it's in 'Windows-874' encoding. I'm not sure whether that encoding name
 is registered with IANA for use in MIME.
 
  X-Mailer: Lotus Notes Release 5.0.5  September 22, 2000
  MIME-Version: 1.0
  Content-type: text/plain; charset=Windows-874

 OK, I didn't look closely at the header, just at the result. Here's another
 test that will be telling - I don't know of any codepage / charset for
 Ethiopic: ሀሁሂሃ

   Yes, this time you made it :-)

 X-Mailer: Lotus Notes Release 5.0.5  September 22, 2000
 MIME-Version: 1.0
 Content-type: text/plain; charset=UTF-8


  Now the question is whether it's possible to force Lotus Notes
to use UTF-8 as  the encoding of the outgoing message  EVEN WHEN
characters in the message are all covered by   existing
encoding other than UTF-8 (e.g. Windows-874 for Thai).

 One exception to this should be US-ASCII because not only the repertoire
of US-ASCII is a subset of the repertoire of UTF-8 but also the
representation of all characters in US-ASCII is identical in UTF-8.
A smart mail client would notice that all characters
are in US-ASCII repertoire  and label outgoing messages as in
US-ASCII EVEN if it's configured to label outgoing messages
in UTF-8 (or any   superset of US-ASCII like EUC-KR, ISO-2022-JP,
GB2312-80 - a better term is certainly EUC-CN but it's not
registered with IANA and GB2312-80  got too widely-spread beyond
remedy-,  ISO8859-[1-9,15]).  There's no violation of standards
in NOT doing this, but doing this would for sure reduce
the possibility of unnecessary 'red-flag' raised by some  mail clients on
the recipient's side. Unfortunately, MS OE and Netscape-Mail
are not smart in this regard while Pine and Mutt are.


  Jungshik Shin


P.S.How about making a sort of resolution to recommend that anybody
writing to this list  should use UTF-8   *if /when* possible?
This was suggested in the past, but we're still getting
a lot of messages in ISO-8859-1 and other encodings.





RE: Is there Unicode mail out there?

2001-07-11 Thread Ayers, Mike


 From: Jungshik Shin [mailto:[EMAIL PROTECTED]] 

   Nothing cryptic. As with others on this thread, your problem is
 to mistake Windows-874 (legacy encoding for Thai) for UTF-8. Because
 Windows-874 does NOT cover Chinese characters, they turned into
 '?'. Judging from your message hader, you're not using MS OE
 but something different.

I am using OE, set to UTF-8.  If I mail Chinese to myself, all is
well.

  X-Mailer: Internet Mail Service (5.5.2653.19)
  Content-Type: text/plain; charset=windows-874
  Content-Transfer-Encoding: 8bit

Odd.  Perhaps our post office is changing things.

   No, it should have been Windows-874 party !! :-).
 Both Mark Davis and Peter Constable sent   messages in Windows-874
 beleiving that they're using UTF-8.

Perhaps, like me, they sent messages in UTF-8 and had them converted
to Windows-874 without consent.  :-(

However, I'm sending this in UTF-8 (after automatic conversion by
 my mail client, Pine 4.33).

I also received it as UTF-8.

Addison
I think you'll find that Peter's response applies to you too: the mailer is
seeing Windows-874 on the incoming message and converting your outgoing
message to use that same encoding (in a bid to be compatible with the
original message). Outlook has done that for awhile. If you manually set the
encoding for the reply you can override that behavior. In Outlook 2000 this
is Format | Encoding
/Addison

Mine already says UTF-8.  Test again: 你好吗?


/|/|ike




Re: Is there Unicode mail out there?

2001-07-11 Thread DougEwell2

In a message dated 2001-07-11 15:03:27 Pacific Daylight Time, 
[EMAIL PROTECTED] writes:

  P.S.How about making a sort of resolution to recommend that anybody
  writing to this list  should use UTF-8   *if /when* possible?
  This was suggested in the past, but we're still getting
  a lot of messages in ISO-8859-1 and other encodings.

Believe me, I would if I could.

-Doug Ewell
 Fullerton, California




Re: Is there Unicode mail out there?

2001-07-11 Thread James Kass

Mike Ayers wrote:



 Okay, I sent these as UTF-8, with some Chinese 
 where the question marks are.  However, the 
 Chinese is getting eaten somewhere along the way.
 Oddly, though, the Thai still displays fine.  Would 
 any Outlook XP guru volunteer to help me get back 
 to my international ways?

 Final test:  

On Outlook Express 5
[Tools] - [Options] - [Read] - [Fonts] -
(Unicode) - {Select appropriate fonts} - {Set as Default}

- then -
[Tools] - [Options] - [Read] - [International Settings] -
{Check the box marked 'Use default encoding for all...'}

This seems to work-around the distressing practice of the
program automatically replying to senders in the sender's 
default rather than the user's preference.

Possibly there are other settings under the [Send] and/or
[Compose] tabs that might also have to be adjusted.  On this
system, the 'reply to senders using the senders format' field
was unchecked, yet my replies to earlier message in the thread
were being sent as Thai (Windows).

Best regards,

James Kass.






Re: Is there Unicode mail out there?

2001-07-11 Thread Jungshik Shin

On Wed, 11 Jul 2001 [EMAIL PROTECTED] wrote:

   Now the question is whether it's possible to force Lotus Notes
 to use UTF-8 as  the encoding of the outgoing message  EVEN WHEN
 characters in the message are all covered by   existing
 encoding other than UTF-8 (e.g. Windows-874 for Thai).

 Well, I'm going to try one more thing -- Thai test, take 2: 
กลัปมาอยู่แล้ว

  Hmm, it didn't work either. Even though you're replying to my
message in UTF-8 (and clearly labeled as such in Content-Type header)
with some Ethiopian characters (which were removed in your reply), Lotus
Notes silently (without your consent) fell back to Windows-874 when you
added some Thai characters in your reply. You may have to do some more
digging to find an option/switch buried deep inside  to make Lotus Notes
use UTF-8 no matter what (or when you want) (instead of using
the 'smallest??' encoding that covers all characters in your
outgoing messages). It seems like Lotus Notes is too 'smart'..

   Jungshik Shin





Re: Is there Unicode mail out there?

2001-07-11 Thread Jungshik Shin




On Wed, 11 Jul 2001 [EMAIL PROTECTED] wrote:

 In a message dated 2001-07-11 15:03:27 Pacific Daylight Time,
 [EMAIL PROTECTED] writes:

   P.S.How about making a sort of resolution to recommend that anybody
   writing to this list  should use UTF-8   *if /when* possible?
   This was suggested in the past, but we're still getting
   a lot of messages in ISO-8859-1 and other encodings.

  Just in case,  I didn't mean to suggest an 'resolution' to force
everyone to use UTF-8. I just wanted to suggest that a gentle and friendly
recommendation be made as to the encoding to use for this list.


 Believe me, I would if I could.

  Apparently, you're using CompuServe. I'm not sure if it's possible
to use a mail client other than one included in CompuServe 'client/browser/
whatever'.

 MIME-Version: 1.0
 Content-Type: text/plain; charset=US-ASCII
 Content-Transfer-Encoding: 7bit
 X-Mailer: CompuServe 2000 32-bit sub 113

  If what I heard is correct, it's possible to use an external mail (IMAP4
or POP3) client like Netscape 6/Mozilla and MS OE to access mail folders
in CompuServe. I also heard that unlike AOL (although CompuServe and
AOL are now affiliated) CompuServe has SMTP servers for subscribers to
use for outgoing messages. If all I said is true, I'm wondering why you
don't switch to one of 'external' mail clients I mentioned to compose
your message in UTF-8. Perhaps, what I heard is not the case and that's
why you can't do it. There is still an option, though, namely switching
your ISP :-) (perhaps, that's not a viable option for some reason)

   Jungshik Shin





RE: Is there Unicode mail out there?

2001-07-11 Thread Jungshik Shin




On Wed, 11 Jul 2001, Ayers, Mike wrote:

   One last time:

   From: Mark Davis [mailto:[EMAIL PROTECTED]]
  
   Yes, that works fine. The Thai comes through clearly: 
กลัปมาอยู่แล้ว
  

   Woohoo!!! UTF-8 party!!!  大家好!!!

  Congratulations ^-^ ! This time you clearly made it with both Thai
and Chinese characters intact in UTF-8. Because either you manually
change the encoding to UTF-8 in the composition window (although you're
replying to a message in Windows-874) or you were replying to a message
encoded in UTF-8.

   Jungshik Shin





Re: Is there Unicode mail out there?

2001-07-11 Thread Jungshik Shin




On Thu, 12 Jul 2001 [EMAIL PROTECTED] wrote:

 In a message dated 2001-07-11 15:03:27 Pacific Daylight Time,
 [EMAIL PROTECTED] writes:

   One exception to this should be US-ASCII because not only the repertoire
   of US-ASCII is a subset of the repertoire of UTF-8 but also the
   representation of all characters in US-ASCII is identical in UTF-8.
   A smart mail client would notice that all characters
   are in US-ASCII repertoire  and label outgoing messages as in
   US-ASCII EVEN if it's configured to label outgoing messages
   in UTF-8

 I thought this might even be enshrined in an RFC.  It certainly makes sense.
 If you are using a mailer that sends CP1252 down the wire (not that this is a
 good idea, but some mailers do this), the mailer should examine the message
 and if it only contains US-ASCII characters, the message should be tagged as
 US-ASCII.  Otherwise, if it only contains ISO 8859-1, it should be tagged as
 ISO 8859-1.  Only if it actually contains CP1252 characters, like smart
 quotes or long dashes, should it be tagged as CP1252.  As Jungshik observed,
 the same goes for UTF-8.

  I can't say it better than you did ! While focusing on
UTF-8, I forgot to mention the case involving Windows-125x, ISO-8859-x
and US-ASCII.

  BTW, some broken/MIME-ignorant mail clients (e.g. Eudora for MS-Windows)
do sorta the opposite. They mislabel outgoing messages as in ISO 8859-1
while they include characters like smart quotes and long dashes. The
best would be to warn users that their messages contain those characters
outside their preferred encoding and to offer a couple of options to
choose from (use Unicode or other wider encodings or 'transliterate'
those characters with those in the repertoire of user's preferred
encoding). Short of that, at least it should label it correctly (not
that I'm in favor of sending out Windows-1252 down the wire.)

   Jungshik Shin





Re: Is there Unicode mail out there?

2001-07-08 Thread James Kass

$B$F$s$I$&$j$e$&$8(B asked:

 Is there Unicode mail out there?

 ... Where is there a Unicode mail? I think there is
 at least one out there. I hope it will not cost me money; 
 that is for sake. That is, the U+9152 sake.
 

Microsoft's Outlook Express offers many e-mail encoding
options, including Unicode (UTF-8) and responding to the
sender in the same encoding as the sender's message.  And,
it won't cost you money.

Best regards,

James Kass.


Re: Is there Unicode mail out there?

2001-07-08 Thread James

James Kass wrote:
 
 $B$F$s$I$$j$e$$8(B asked:
 
  Is there Unicode mail out there?
 
  ... Where is there a Unicode mail? I think there is
  at least one out there. I hope it will not cost me money;
  that is for sake. That is, the U+9152 sake.

Email has proven to be one of the protocols slowest
to adopt Unicode. Everybody is still generally
using legacy encodings, especially with list servers.
Sending Unicode to individuals with email clients that you know
can support it is ok.
 
 Microsoft's Outlook Express offers many e-mail encoding
 options, including Unicode (UTF-8) and responding to the
 sender in the same encoding as the sender's message.  And,
 it won't cost you money.

Just your soul ...