subject:"RE\: Is there Unicode mail out there\?"

Re: Is there Unicode mail out there?

2001-07-23 Thread James Kass

Mark Davis wrote:

> The quotation I have is from my college Greek textbook (sadly my fluency has
> reduced to essentially zero after all these years).
>
> Perhaps some Greeks on the list could say which is the more accurate
> formulation?
>
> Mark
> —
>
> πάντων μέτρον ἄνθρωπος — Πρωταγόρας
> [http://www.macchiato.com]
>

In "Dictionary of Foreign Phrases and Abbreviations" (Guinagh, 1965),
the following appears:

" Panton metron anthropos estin.  Gk--Man is the measure of
  all things.  Quoted by Plato, Theaetetus, 178b. "

Best regards,

James Kass.

Re: Is there Unicode mail out there?

2001-07-22 Thread Martin Duerst

Sorry - By 'pattern restrictions on mixed content' I meant a
feature in XML Schema that would allow to specify that the
mixed content in certain elements is restricted by a pattern
facet. This is a feature that isn't in XML Schema, but that
has been discussed. This would allow to define that a document
does not allow C0 control characters, a feature that would
be very important for many cases if the basic XML syntax
would start to allow C0.

Regards,   Martin.

At 10:32 01/07/19 -0600, Shigemichi Yazawa wrote:
>At Thu, 19 Jul 2001 15:52:39 +0900,
>Martin Duerst <[EMAIL PROTECTED]> wrote:
> > Of course then pattern restrictions on mixed content (which we
> > currently don't have) would become really helpful.
>
>Martin,
>
>What kind of pattern restrictions are necessary by introducing C0 NCR?
>Something like this? $B
>
>---
>Shigemichi Yazawa
>[EMAIL PROTECTED]

Re: Is there Unicode mail out there?

2001-07-20 Thread John Cowan


Ayers, Mike scripsit:

>   Simple.  Since "]]>" is used to mark the end of a CDATA section, and
> since CDATA can contain anything, if you want to put the sequence "]]>"
> INSIDE your CDATA, then you must escape the ">", or else it will END your
> CDATA.

That isn't what it says, and isn't true to boot.  There is no way at
all to put "]]>" into a CDATA section.

If you change the ">" to ">", it will not terminate the CDATA section,
true -- but because entity references aren't recognized in CDATA sections,
your application will wind up with the literal characters '&', 'g', 't', ';'.

About all you can do is terminate the CDATA section, insert the ">", and
restart the CDATA section, thus:

>>" terminates
it, the next ">" is outside, and the "

Re: Is there Unicode mail out there?

2001-07-20 Thread Mark Davis

The quotation I have is from my college Greek textbook (sadly my fluency has
reduced to essentially zero after all these years).

Perhaps some Greeks on the list could say which is the more accurate
formulation?

Mark
—

πάντων μέτρον ἄνθρωπος — Πρωταγόρας
[http://www.macchiato.com]

- Original Message -
From: "Otto Stolz" <[EMAIL PROTECTED]>
To: "Mark Davis" <[EMAIL PROTECTED]>
Cc: "unicode" <[EMAIL PROTECTED]>
Sent: Friday, July 20, 2001 09:18
Subject: Re: Is there Unicode mail out there?

Mark Davis wrote:

>
>πάντων μέτρον ἄνθρωπος — Πρωταγόρας
>
You mean “πάντων χρημάτων μέτρον ἄνθρωπος”, dont 
you? ;-)

Best wishes,
  Otto Stolz

RE: Is there Unicode mail out there?

2001-07-20 Thread Ayers, Mike



> From: Tex Texin [mailto:[EMAIL PROTECTED]] 

> So it must not be an NCR, EXCEPT in the seemingly rare case where
> the string "]]>" appears in content AND that string is not being
> used to indicate the end of a CDATA section.
> 
> How is that supposed to be read?

Simple.  Since "]]>" is used to mark the end of a CDATA section, and
since CDATA can contain anything, if you want to put the sequence "]]>"
INSIDE your CDATA, then you must escape the ">", or else it will END your
CDATA.

In other words, CDATA can contain anything except literal "]]>".

Think "*/" and C/C++...

HTH,


/|/|ike

Re: Is there Unicode mail out there?

2001-07-20 Thread Tex Texin


John,
ok and thanks. I wasn't looking at the "may" though, I was looking
at the "must".

Maybe I am not parsing this sentence right. To me it says:

(must, for compatibility, be escaped using ">" )

or

(a character reference when it appears in the string "]]>" in
content, when that string is not marking the end of a CDATA
section.)

So it must not be an NCR, EXCEPT in the seemingly rare case where
the string "]]>" appears in content AND that string is not being
used to indicate the end of a CDATA section.

How is that supposed to be read?

tex

John Cowan wrote:
> 
> Tex Texin scripsit:
> 
> > Which seemed to me to rule out the NCR for > in situations other
> > than "]]>" for compatibility reasons.
> >
> > "If they are needed elsewhere, they must be escaped using either
> > numeric character references or the strings "&" and "<"
> > respectively. The right angle bracket (>) may be represented using
> > the string ">", and must, for compatibility, be escaped using
> > ">" or a character reference when it appears in the string "]]>"
> > in content, when that string is not marking the end of a CDATA
> > section."
> 
> Naah.  Just because it says "may" doesn't mean anything: what "may" be
> done, also "may" be not done.  You may use a numeric character
> reference for any legal character.
> 
> --
> John Cowan   [EMAIL PROTECTED]
> One art/there is/no less/no more/All things/to do/with sparks/galore
> --Douglas Hofstadter

-- 
---
Tex Texin  Director, International Business
mailto:[EMAIL PROTECTED]  +1-781-280-4271
Fax:+1-781-280-4655
the Progress Company   14 Oak Park, Bedford, MA 01730
---

Re: Is there Unicode mail out there?

2001-07-20 Thread John Cowan


Tex Texin scripsit:

> Which seemed to me to rule out the NCR for > in situations other
> than "]]>" for compatibility reasons.
> 
> "If they are needed elsewhere, they must be escaped using either
> numeric character references or the strings "&" and "<"
> respectively. The right angle bracket (>) may be represented using
> the string ">", and must, for compatibility, be escaped using
> ">" or a character reference when it appears in the string "]]>"
> in content, when that string is not marking the end of a CDATA
> section."

Naah.  Just because it says "may" doesn't mean anything: what "may" be
done, also "may" be not done.  You may use a numeric character
reference for any legal character.

-- 
John Cowan   [EMAIL PROTECTED]
One art/there is/no less/no more/All things/to do/with sparks/galore
--Douglas Hofstadter

RE: Is there Unicode mail out there?

2001-07-20 Thread Shigemichi Yazawa


At Thu, 19 Jul 2001 13:11:35 -0500,
Ayers, Mike <[EMAIL PROTECTED]> wrote:
>   I'm proposing it as a convention, not a proprietary solution.  I
> agree that a standard solution would be preferred, especially Martin's
> suggestion of permitting the escape codes but not the characters.  I
> proposed the markup as a workaround until a better solution could be found.

This sounds good. Can we submit a proposition to W3C? I believe that
it helps many people.

-
Shigemichi Yazawa
[EMAIL PROTECTED]

RE: Is there Unicode mail out there?

2001-07-20 Thread Bill Kurmey


At 01:11 PM 7/19/01 -0500, Mike Ayers wrote:

>   The work has to be done somewhere.  Emerging technologies must be
>compatible with existing ones, and some old technologies hang around a long
>time.  Really, the disallowing of control characters makes sense, since
>their interpretation in so many exisiting protocols is "wreak havoc upon the
>unsuspecting".  You simply can't send these characters around the internet
>and expect them to arrive unchanged.

Does anyone have a (list, web site, reference) which lists which C0 and C1
control codes "wreak havoc upon the unsuspecting" and why?  



Bill Kurmey, Edmonton, AB, Canada

Re: Is there Unicode mail out there?

2001-07-19 Thread Tex Texin

Lars,
I was looking at Section 2.4 Character Data and Markup:
http://www.w3.org/TR/2000/REC-xml-20001006#syntax

Which seemed to me to rule out the NCR for > in situations other
than "]]>" for compatibility reasons.

"If they are needed elsewhere, they must be escaped using either
numeric character references or the strings "&" and "<"
respectively. The right angle bracket (>) may be represented using
the string ">", and must, for compatibility, be escaped using
">" or a character reference when it appears in the string "]]>"
in content, when that string is not marking the end of a CDATA
section."

tex

Lars Marius Garshol wrote:
> 
> * Tex Texin
> |
> | XML restricts the character set which by implication restricts the
> | NCR values. I see that > can't use an NCR but < can.
> 
> They can both use NCRs. In fact, the example definitions of the
> predefined entities do just that:
> 
>   http://www.w3.org/TR/REC-xml#sec-predefined-ent >
> 
> --Lars M.

-- 
---
Tex Texin  Director, International Business
mailto:[EMAIL PROTECTED]  +1-781-280-4271
Fax:+1-781-280-4655
the Progress Company   14 Oak Park, Bedford, MA 01730
---

RE: Is there Unicode mail out there?

2001-07-19 Thread Ayers, Mike



> From: Shigemichi Yazawa [mailto:[EMAIL PROTECTED]] 

> XML states "Its goal is to enable generic SGML to be served, received,
> and processed on the Web in the way that is now possible with HTML."
> But, in my opinion, XML has outgrown its original goal way too
> far. XML seems to be used in every aspect of software engineering
> these days.

True, but don't blame W3C for the digital hammer effect.

> Tagging disallowed characters is one way to work around the
> problem. But I don't buy this solution for two reasons.
> 
> 1. Markup is for describing a document's structure. 1 Introduction
>says "Markup encodes a description of the document's storage layout
>and logical structure."

That's how it works in theory.  In practice, however, pictures,
applets, and many other non-structural components are encoded with markup.

> 2. This is a proprietary solution. To get the original character, the
>apprication needs to know the semantics of the markup and needs to
>know how to decode the data appropriately. If it's the standard
>encoding like NCR, that's fine because everybody knows how to deal
>with it. But the tagging is specific to a DTD. It makes difficult
>to interchange the data.

I'm proposing it as a convention, not a proprietary solution.  I
agree that a standard solution would be preferred, especially Martin's
suggestion of permitting the escape codes but not the characters.  I
proposed the markup as a workaround until a better solution could be found.

> This character restriction in XML makes a XML document creation
> difficult. 

The work has to be done somewhere.  Emerging technologies must be
compatible with existing ones, and some old technologies hang around a long
time.  Really, the disallowing of control characters makes sense, since
their interpretation in so many exisiting protocols is "wreak havoc upon the
unsuspecting".  You simply can't send these characters around the internet
and expect them to arrive unchanged.


/|/|ike

RE: Is there Unicode mail out there?

2001-07-19 Thread Ayers, Mike



> From: John Cowan [mailto:[EMAIL PROTECTED]] 

> I think that any proposal to shrink the range of well-formed documents
> is simply a nonstarter, regrettable as that is.

I had thought that one of the main goals of XML Blueberry was
mainframe compatibility.  If so, won't they need to disallow the C1
characters which wreak havoc on mainframe terminals?  If they can make that
change, other relatively minor changes could be made at that time (if ever).
That's my thinking, anyway.

Should I be crossposting the XML folks on this?


/|/|ike

Re: Is there Unicode mail out there?

2001-07-19 Thread Andy Heninger


I agree with the overall sentiment here, but here's one nit

> Or you are so lazy that
> you want to put it [your data] in CDATA section without checking it at
all.

CDATA sections have a severe problem, which is that there is no
way to escape otherwise legal XML characters that can't be
represented in the chosen document encoding.

The best bet is to avoid CDATA sections altogether.

Andy Heninger
IBM, Cupertino, CA
[EMAIL PROTECTED]


- Original Message -
From: "Shigemichi Yazawa" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Thursday, July 19, 2001 12:03 AM
Subject: RE: Is there Unicode mail out there?


> At Wed, 18 Jul 2001 14:21:35 -0500,
> Ayers, Mike <[EMAIL PROTECTED]> wrote:
> > So why not used tagged data to represent C0 and C1 characters?  That
> > is what XML is made of.  As far as why control characters are not
permitted,
> > it seems to ma that this is so that XML documents can be passed around
> > easily, through HTTP, email, FTP and so on, without loss of data.
Protocols
> > abound which interpret control characters, so XML files which contain
data
> > may get mangled or may mangle the systems which pass them.  However,
if that
> > data is included as tagged hex digits, no problem will occur either
way.
>
> XML states "Its goal is to enable generic SGML to be served, received,
> and processed on the Web in the way that is now possible with HTML."
> But, in my opinion, XML has outgrown its original goal way too
> far. XML seems to be used in every aspect of software engineering
> these days.
>
> Tagging disallowed characters is one way to work around the
> problem. But I don't buy this solution for two reasons.
>
> 1. Markup is for describing a document's structure. 1 Introduction
>says "Markup encodes a description of the document's storage layout
>and logical structure." You could do something like codepoint="000c" />. This doesn't express any structure of the
>document, though. Using a markup merely to escape a character is
>too hacky, in my opinion.
>
> 2. This is a proprietary solution. To get the original character, the
>apprication needs to know the semantics of the markup and needs to
>know how to decode the data appropriately. If it's the standard
>encoding like NCR, that's fine because everybody knows how to deal
>with it. But the tagging is specific to a DTD. It makes difficult
>to interchange the data.
>
> This character restriction in XML makes a XML document creation
> difficult. Say you have some data you want to wrap in XML. You don't
> know much anout the content of the data. What you know about it is its
> character encoding and that it is textual data. That's fine because
> you just want to wrap it in XML. You would check if it contains "<"
> or "&" and convert them to entity references. Or you are so lazy that
> you want to put it in CDATA section without checking it at all. The
> problem is that it might contain C0 control codes, which are legal
> characters for most of the encodings. Unless you are absolutely sure
> that the data doesn't contain any control codes, you have to check
> every characters to make sure that you don't produce ill-formed XML
> document. Even if you find a control, there isn't a standard way to
> treat it. You end up deleting it or escaping it in a proprietary way.
>
> -
> Shigemichi Yazawa
> [EMAIL PROTECTED]
>
>

Re: Is there Unicode mail out there?

2001-07-19 Thread Shigemichi Yazawa


At Thu, 19 Jul 2001 15:52:39 +0900,
Martin Duerst <[EMAIL PROTECTED]> wrote:
> Of course then pattern restrictions on mixed content (which we
> currently don't have) would become really helpful.

Martin,

What kind of pattern restrictions are necessary by introducing C0 NCR?
Something like this? $B

---
Shigemichi Yazawa
[EMAIL PROTECTED]

Re: Is there Unicode mail out there?

2001-07-19 Thread Mark Davis


I agree.

Mark
—

πάντων μέτρον ἄνθρωπος — Πρωταγόρας
[http://www.macchiato.com]

- Original Message -
From: "Martin Duerst" <[EMAIL PROTECTED]>
To: "Mark Davis" <[EMAIL PROTECTED]>; "John Cowan"
<[EMAIL PROTECTED]>
Cc: <[EMAIL PROTECTED]>; "Lars Marius Garshol" <[EMAIL PROTECTED]>
Sent: Wednesday, July 18, 2001 23:52
Subject: Re: Is there Unicode mail out there?


> I think that the right solution, if we could redo things, would
> be to allow something like  in content, but to never use
> the actual byte values. This would allow the data guys to stream
> stuff, and could leave the document guys reasonably unconcerned.
> Of course then pattern restrictions on mixed content (which we
> currently don't have) would become really helpful.
>
> Regards,   Martin.
>
> At 08:07 01/07/18 -0700, Mark Davis wrote:
> > > I wouldn't want any control codes in a database. Having a control-G
> > > may be funny (the joke as I know it goes back to Don Knuth), but
> > > something like a control-S is too much of a risk.
> >
> >*You* wouldn't want?
> >
> >There are a lot of characters *I* wish were not in databases, or in use
at
> >all. A lot of them may or may not make sense. Whether or not I want them,
> >someone can have a database where they are allowed. By having this
> >(inconsistent) restriction, it simply means I can't be guaranteed full
> >round-tripping  from databases to XML and back, no matter what their
> >content.
> >
> >Of course, this is not a huge restriction -- it is simply a gratuitous
> >annoyance. One could even live with something much more onerous, say XML
> >disallowing all characters whose code points were divisible by 4321 --
just
> >have complicated DTDs and shift into base64 if you encounter any of those
> >codes.
>
>

Re: Is there Unicode mail out there?

2001-07-19 Thread Martin Duerst


I think that the right solution, if we could redo things, would
be to allow something like  in content, but to never use
the actual byte values. This would allow the data guys to stream
stuff, and could leave the document guys reasonably unconcerned.
Of course then pattern restrictions on mixed content (which we
currently don't have) would become really helpful.

Regards,   Martin.

At 08:07 01/07/18 -0700, Mark Davis wrote:
> > I wouldn't want any control codes in a database. Having a control-G
> > may be funny (the joke as I know it goes back to Don Knuth), but
> > something like a control-S is too much of a risk.
>
>*You* wouldn't want?
>
>There are a lot of characters *I* wish were not in databases, or in use at
>all. A lot of them may or may not make sense. Whether or not I want them,
>someone can have a database where they are allowed. By having this
>(inconsistent) restriction, it simply means I can't be guaranteed full
>round-tripping  from databases to XML and back, no matter what their
>content.
>
>Of course, this is not a huge restriction -- it is simply a gratuitous
>annoyance. One could even live with something much more onerous, say XML
>disallowing all characters whose code points were divisible by 4321 -- just
>have complicated DTDs and shift into base64 if you encounter any of those
>codes.

RE: Is there Unicode mail out there?

2001-07-19 Thread Shigemichi Yazawa

At Wed, 18 Jul 2001 14:21:35 -0500,
Ayers, Mike <[EMAIL PROTECTED]> wrote:
>   So why not used tagged data to represent C0 and C1 characters?  That
> is what XML is made of.  As far as why control characters are not permitted,
> it seems to ma that this is so that XML documents can be passed around
> easily, through HTTP, email, FTP and so on, without loss of data.  Protocols
> abound which interpret control characters, so XML files which contain data
> may get mangled or may mangle the systems which pass them.  However, if that
> data is included as tagged hex digits, no problem will occur either way.

XML states "Its goal is to enable generic SGML to be served, received,
and processed on the Web in the way that is now possible with HTML."
But, in my opinion, XML has outgrown its original goal way too
far. XML seems to be used in every aspect of software engineering
these days.

Tagging disallowed characters is one way to work around the
problem. But I don't buy this solution for two reasons.

1. Markup is for describing a document's structure. 1 Introduction
   says "Markup encodes a description of the document's storage layout
   and logical structure." You could do something like . This doesn't express any structure of the
   document, though. Using a markup merely to escape a character is
   too hacky, in my opinion.

2. This is a proprietary solution. To get the original character, the
   apprication needs to know the semantics of the markup and needs to
   know how to decode the data appropriately. If it's the standard
   encoding like NCR, that's fine because everybody knows how to deal
   with it. But the tagging is specific to a DTD. It makes difficult
   to interchange the data.

This character restriction in XML makes a XML document creation
difficult. Say you have some data you want to wrap in XML. You don't
know much anout the content of the data. What you know about it is its
character encoding and that it is textual data. That's fine because
you just want to wrap it in XML. You would check if it contains "<"
or "&" and convert them to entity references. Or you are so lazy that
you want to put it in CDATA section without checking it at all. The
problem is that it might contain C0 control codes, which are legal
characters for most of the encodings. Unless you are absolutely sure
that the data doesn't contain any control codes, you have to check
every characters to make sure that you don't produce ill-formed XML
document. Even if you find a control, there isn't a standard way to
treat it. You end up deleting it or escaping it in a proprietary way.

-
Shigemichi Yazawa
[EMAIL PROTECTED]

Re: Is there Unicode mail out there?

2001-07-18 Thread Mark Davis


It would be fine if you could escape them; after all, nobody does expect the
raw codes to work. However, you can't use the normal escapes to encode them,
e.g. 

Mark
—

πάντων μέτρον ἄνθρωπος — Πρωταγόρας
[http://www.macchiato.com]

- Original Message -
From: "Ayers, Mike" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Wednesday, July 18, 2001 12:21
Subject: RE: Is there Unicode mail out there?


>
> > From: Shigemichi Yazawa [mailto:[EMAIL PROTECTED]]
>
> > > From: "Bill Kurmey" <[EMAIL PROTECTED]>
>
> > > > My concern stems from working with an email archive
> > format which uses soh,
> > > > stx and etx as an envelope.
> >
> > Good point. U+000c is also used frequently in email's and news
> > article's body. It may not make sense to allow control characters in
> > HTML, but it does make sense in XML when it is used as a container of
> > data including legacy data like email archives.
>
> So why not used tagged data to represent C0 and C1 characters?  That
> is what XML is made of.  As far as why control characters are not
permitted,
> it seems to ma that this is so that XML documents can be passed around
> easily, through HTTP, email, FTP and so on, without loss of data.
Protocols
> abound which interpret control characters, so XML files which contain data
> may get mangled or may mangle the systems which pass them.  However, if
that
> data is included as tagged hex digits, no problem will occur either way.
>
> The UTC folks may wish to consider contacting the XML Blueberry
> folks about getting the XML character repertoire cleaned up in that
project.
>
>
> /|/|ike
>
>

RE: Is there Unicode mail out there?

2001-07-18 Thread Ayers, Mike



> From: Shigemichi Yazawa [mailto:[EMAIL PROTECTED]] 

> > From: "Bill Kurmey" <[EMAIL PROTECTED]>

> > > My concern stems from working with an email archive 
> format which uses soh,
> > > stx and etx as an envelope.
> 
> Good point. U+000c is also used frequently in email's and news
> article's body. It may not make sense to allow control characters in
> HTML, but it does make sense in XML when it is used as a container of
> data including legacy data like email archives.

So why not used tagged data to represent C0 and C1 characters?  That
is what XML is made of.  As far as why control characters are not permitted,
it seems to ma that this is so that XML documents can be passed around
easily, through HTTP, email, FTP and so on, without loss of data.  Protocols
abound which interpret control characters, so XML files which contain data
may get mangled or may mangle the systems which pass them.  However, if that
data is included as tagged hex digits, no problem will occur either way.

The UTC folks may wish to consider contacting the XML Blueberry
folks about getting the XML character repertoire cleaned up in that project.


/|/|ike

Re: Is there Unicode mail out there?

2001-07-18 Thread Shigemichi Yazawa


> - Original Message -
> From: "Bill Kurmey" <[EMAIL PROTECTED]>
> To: "Mark Davis" <[EMAIL PROTECTED]>
> Sent: Wednesday, July 18, 2001 03:08
> Subject: Is there Unicode mail out there?
> 
> > Am I missing something somewhere in the specifications on the W3C site?
> > Where is there a reference forbidding an XML processor from handling ANY
> > character that is defined in Unicode and ISO/IEC 10646?

Look at http://www.w3.org/TR/2000/REC-xml-20001006#NT-Char

[2] Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] |/* any Unicode character,
 [#xE000-#xFFFD] | [#x1-#x10] excluding the surrogate blocks,
  FFFE, and . */

I think XML spec is self contradictory.

> > My concern stems from working with an email archive format which uses soh,
> > stx and etx as an envelope.

Good point. U+000c is also used frequently in email's and news
article's body. It may not make sense to allow control characters in
HTML, but it does make sense in XML when it is used as a container of
data including legacy data like email archives.

---
Shigemichi Yazawa
[EMAIL PROTECTED]

Re: FW: Re: Is there Unicode mail out there?

2001-07-18 Thread Youtie Effaight


>From: ÄñÇ¤èã¤¶ <[EMAIL PROTECTED]>
>What's control-S?

Jeez Leweez... Use the Net, Luke! Do some research for a change and show us 
your intelligence rather than advertise your laziness. Betcha it wouldn't 
take ten minutes to find the answer.

Yer ol' Pal,
Youtie




_
Get your FREE download of MSN Explorer at http://explorer.msn.com

FW: Re: Is there Unicode mail out there?

2001-07-18 Thread




$B"!$8$e$&$$$C$A$c$s"!(B

?$B!V0&!W$O!V$"$$!W$G$9!#!V(B10^-10$B!W$b!V$"$$!W$G$9!#!h!V0&!W$O!V(B10^-10$B!W$G$9!#(B
?$B;d$O$m$3$($s$i$+$Y$5$G$9!#(B
?$B"v%i!<%a%s$O$G$9$h"v(B



>> I wouldn't want any control codes in a database. Having a control-G
>> may be funny (the joke as I know it goes back to Don Knuth), but
>> something like a control-S is too much of a risk.

I remember from grade school. To have the computer make noise, I had it print 
control-G.
What's control-S?

Isn't control-M just line break?
An indispensable code!

Re: Is there Unicode mail out there?

2001-07-18 Thread Mark Davis


I believe that they are formally disallowed, if one traces through the right
path in the standards. Rather than do that myself, I believe that the XML
lawyers on this list can tell you the precise answer more quickly.

Mark

—

πάντων μέτρον ἄνθρωπος — Πρωταγόρας
[http://www.macchiato.com]

- Original Message -
From: "Bill Kurmey" <[EMAIL PROTECTED]>
To: "Mark Davis" <[EMAIL PROTECTED]>
Sent: Wednesday, July 18, 2001 03:08
Subject: Is there Unicode mail out there?


> I can find no restriction on control codes in HTML 4.01.  Nor on their
> representation as NCRs in either decimal or hexadecimal form.
>
> Section 2.2 of the XML-20001006 states
>
> "Legal characters are tab, carriage return, line feed, and the legal
> characters of Unicode and ISO/IEC 10646. The versions of these standards
> cited in A.1 Normative References were current at the time this document
> was prepared. New characters may be added to these standards by amendments
> or new editions.   Consequently, XML processors must accept any character
> in the range specified for Char."
>
> I can't find any statement that indicates that an XML processor cannot
> accept a control character that is a "legal" character in Unicode and
> ISO/IEC 10646, only if an ENCODING contains an octet sequence that is,
> presumably, not legal in Unicode and ISO/IEC 10646.  I interpret 2.2 to
> mean that the XML processor MUST accept the characters specified in 2.2,
> but need not be limited to those characters.
>
> "It is a fatal error when an XML processor encounters an entity with an
> encoding that it is unable to process. It is a fatal error if an XML
entity
> is determined (via default, encoding declaration, or higher-level
protocol)
> to be in a certain encoding but contains octet sequences that are not
legal
> in that
> encoding. It is also a fatal error if an XML entity contains no encoding
> declaration and its content is not legal UTF-8 or UTF-16."
>
> Am I missing something somewhere in the specifications on the W3C site?
> Where is there a reference forbidding an XML processor from handling ANY
> character that is defined in Unicode and ISO/IEC 10646?
>
> My concern stems from working with an email archive format which uses soh,
> stx and etx as an envelope.
>
> > Mark Davis wrote:
> >
> > > I had been told by the W3C people that the reason for forbidding
control
> > > characters in XML and HTML was for compatibility with SGML.
> >
> >
> > More accurately, with the SGML default syntax, which is used in HTML
> > and (with a few modifications) in XML.
>
>
>
> Bill Kurmey, Edmonton, AB, Canada
>
>

Re: Is there Unicode mail out there?

2001-07-18 Thread Mark Davis


> I wouldn't want any control codes in a database. Having a control-G
> may be funny (the joke as I know it goes back to Don Knuth), but
> something like a control-S is too much of a risk.

*You* wouldn't want?

There are a lot of characters *I* wish were not in databases, or in use at
all. A lot of them may or may not make sense. Whether or not I want them,
someone can have a database where they are allowed. By having this
(inconsistent) restriction, it simply means I can't be guaranteed full
round-tripping  from databases to XML and back, no matter what their
content.

Of course, this is not a huge restriction -- it is simply a gratuitous
annoyance. One could even live with something much more onerous, say XML
disallowing all characters whose code points were divisible by 4321 -- just
have complicated DTDs and shift into base64 if you encounter any of those
codes.

Mark
—

πάντων μέτρον ἄνθρωπος — Πρωταγόρας
[http://www.macchiato.com]

- Original Message -
From: "Martin Duerst" <[EMAIL PROTECTED]>
To: "Mark Davis" <[EMAIL PROTECTED]>; "John Cowan"
<[EMAIL PROTECTED]>
Cc: <[EMAIL PROTECTED]>; "Lars Marius Garshol" <[EMAIL PROTECTED]>
Sent: Tuesday, July 17, 2001 18:36
Subject: Re: Is there Unicode mail out there?


> At 14:30 01/07/17 -0700, Mark Davis wrote:
> > > In that case the content of the field is not text but an octet string,
> > > and you need to do something different, like base64-ing it.
> >
> >The content in the database is not an octet string: it is a text field
that
> >happens to have a control code -- a legitimate character code -- in it.
> >Practically every database allows control codes in text fields. (And why
are
> >C1 controls allowed? After all, they are even less frequent than C0
> >controls.)
>
> Mark - I understand your dissatisfaction. But the C1 controls are not
> allowed in HTML4, and according to James Clark, the fact that they are
> allowed in XML was an oversight.
>
> Databases can (and should) keep care of their data. There are very
> few cases where having control characters in there makes sense.
> In the most cases, however, they are errors, and if XML gives an
> incentive to fix them, all the better.
>
> I wouldn't want any control codes in a database. Having a control-G
> may be funny (the joke as I know it goes back to Don Knuth), but
> something like a control-S is too much of a risk.
>
>
> Regards,   Martin.
>
>

Re: Is there Unicode mail out there?

2001-07-17 Thread Martin Duerst

At 14:30 01/07/17 -0700, Mark Davis wrote:
> > In that case the content of the field is not text but an octet string,
> > and you need to do something different, like base64-ing it.
>
>The content in the database is not an octet string: it is a text field that
>happens to have a control code -- a legitimate character code -- in it.
>Practically every database allows control codes in text fields. (And why are
>C1 controls allowed? After all, they are even less frequent than C0
>controls.)

Mark - I understand your dissatisfaction. But the C1 controls are not
allowed in HTML4, and according to James Clark, the fact that they are
allowed in XML was an oversight.

Databases can (and should) keep care of their data. There are very
few cases where having control characters in there makes sense.
In the most cases, however, they are errors, and if XML gives an
incentive to fix them, all the better.

I wouldn't want any control codes in a database. Having a control-G
may be funny (the joke as I know it goes back to Don Knuth), but
something like a control-S is too much of a risk.

Regards,   Martin.

Re: Is there Unicode mail out there?

2001-07-17 Thread John Cowan


[EMAIL PROTECTED] scripsit:

> I was just looking through the XML spec today, and the only non-characters 
> excluded (other than the surrogates) are 0xFFFE and 0x.

Unfortunately, there's nothing we can do about it now, nor about the useless
C1 controls other than NEL.  Shrinking the range of well-formed documents
is an immediate loser, even if there is no plausible use for such
documents.

Just pretend you'll never get one of the legal non-characters.

-- 
John Cowan   [EMAIL PROTECTED]
One art/there is/no less/no more/All things/to do/with sparks/galore
--Douglas Hofstadter

Re: Is there Unicode mail out there?

2001-07-17 Thread DougEwell2

In a message dated 2001-07-17 2:24:44 Pacific Daylight Time, 
[EMAIL PROTECTED] writes:

>  The document character set used by HTML
>  is Unicode, but some characters have been disallowed, and may not
>  appear in documents, whether directly or by reference. These are
>
>   U+ - U+0009
>   U+000B - U+000C
>   U+000E - U+0019
>   U+007F - U+009F
>   U+D800 - U+DFFF 

This list, and others like it, needs to be updated to include the 
non-characters (0xFDD0 through 0xFDEF, plus all code points whose low-order 
16 bits are 0xFFFE or 0x).

I was just looking through the XML spec today, and the only non-characters 
excluded (other than the surrogates) are 0xFFFE and 0x.

-Doug Ewell
 Fullerton, California

Re: Is there Unicode mail out there?

2001-07-17 Thread Mark Davis

> In that case the content of the field is not text but an octet string,
> and you need to do something different, like base64-ing it.

The content in the database is not an octet string: it is a text field that
happens to have a control code -- a legitimate character code -- in it.
Practically every database allows control codes in text fields. (And why are
C1 controls allowed? After all, they are even less frequent than C0
controls.)

Your task is to design an XML DTD to represent a selection from a database.
The database is nothing fancy: Latin-1 encoded. It is conceivable that a
control character is in one of the hundreds of thousands of records. Not
likely, but conceivable. You must guarantee no loss of data in the XML
representation of the data.

If XML could represent all control characters, then an instance of a
selection in XML might be as simple as the following.

  John
  Smith
  1950-10-10
...

The DTD would also be simple. Now, change the DTD (*and* the program that
interprets it) so that each and every text field could be a base64 instead.
Very ugly. You don't want to simply change all the fields to base64, since
that would (a) bulk them up and (b) make them unreadable for debugging. So
you end up having each field have two alternate representations. And in your
parser you have to be prepared for either, and in your generator you have to
pick between them.

Notice that for *any* database that allows control codes, to avoid data
corruption you would have to do such ugliness for any XML representation. Of
course, nobody does it, which means that there is always the opportunity for
data corruption. Of course, one might just not care -- after all, it would
be rare that this would cause a problem.

Mark

—

πάντων μέτρον ἄνθρωπος — Πρωταγόρας
[http://www.macchiato.com]

- Original Message -
From: "John Cowan" <[EMAIL PROTECTED]>
To: "Mark Davis" <[EMAIL PROTECTED]>
Cc: <[EMAIL PROTECTED]>; "Lars Marius Garshol" <[EMAIL PROTECTED]>;
"Martin Duerst" <[EMAIL PROTECTED]>
Sent: Tuesday, July 17, 2001 11:10
Subject: Re: Is there Unicode mail out there?

> Mark Davis wrote:
>
> > I had been told by the W3C people that the reason for forbidding control
> > characters in XML and HTML was for compatibility with SGML.
>
>
> More accurately, with the SGML default syntax, which is used in HTML
> and (with a few modifications) in XML.
>
>
> > When you are thinking of XML as a general transmission mechanism for
data
> > (not just a text document) it becomes clear. Suppose that you have a
> > database, of any sort. Some fields may or may not contain control
> > characters -- since control characters are perfectly legal in many if
not
> > all databases. You want to query that database and get a selection,
packaged
> > as XML.
>
>
> In that case the content of the field is not text but an octet string,
> and you need to do something different, like base64-ing it.
>
> --
> There is / one art || John Cowan <[EMAIL PROTECTED]>
> no more / no less  || http://www.reutershealth.com
> to do / all things || http://www.ccil.org/~cowan
> with art- / lessness   \\ -- Piet Hein
>
>

Re: Is there Unicode mail out there?

2001-07-17 Thread Mark Davis


I had been told by the W3C people that the reason for forbidding control
characters in XML and HTML was for compatibility with SGML. I've never
checked it, since unfortunately the SGML standard is not online. If not
true, that's very interesting.

When you are thinking of XML as a general transmission mechanism for data
(not just a text document) it becomes clear. Suppose that you have a
database, of any sort. Some fields may or may not contain control
characters -- since control characters are perfectly legal in many if not
all databases. You want to query that database and get a selection, packaged
as XML.

Unfortunately, you have to invent your own home-brew quoting mechanism for
the control characters, since the standard XML does not permit you to
represent all of the -- perfectly valid -- characters in that database. And
such a home-brew mechanism will not interwork with anything else.

Conversely, you could filter out the control characters. That, of course,
would corrupt the data. Generally considered a bad thing.

Mark

—

πάντων μέτρον ἄνθρωπος — Πρωταγόρας
[http://www.macchiato.com]

- Original Message -
From: "Lars Marius Garshol" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Tuesday, July 17, 2001 02:28
Subject: Re: Is there Unicode mail out there?


>
> * Mark Davis
> |
> | The HTML spec depends on the SGML spec for a characterization of
> | allowable characters. The latter, unfortunately, disallows some
> | valid Unicode characters (most C0 controls), but inconsistently
> | allows other similar characters (C1 controls).
>
> SGML is silent on the issue of what characters are allowed. It is the
> SGML declaration used by each application which decides this, and you
> can easily make an SGML declaration which allows every Unicode
> character.
>
> To wit:
>
>   CHARSET
>   BASESET  "ISO Registration Number 177//CHARSET
> ISO/IEC 10646-1:1993 UCS-4 with
> implementation level 3//ESC 2/5 2/15 4/6"
>  DESCSET 0   55296   0
>  55296   2048UNUSED  -- SURROGATES --
>  57344   1056768 57344
>
> CAPACITYSGMLREF
> TOTALCAP15
> GRPCAP  15
> ENTCAP  15
>
> SCOPEDOCUMENT
> SYNTAX
>  SHUNCHAR NONE
>  BASESET  "ISO 646IRV:1991//CHARSET
>International Reference Version
>(IRV)//ESC 2/8 4/2"
>  DESCSET  0 128 0  FUNCTION
>   RE13
>   RS10
>   SPACE 32
>   TAB SEPCHAR9
>
>  NAMING   LCNMSTRT ""
>   UCNMSTRT ""
>   LCNMCHAR ".-_:"
>   UCNMCHAR ".-_:"
>   NAMECASE GENERAL YES
>ENTITY  NO
>
>  DELIMGENERAL  SGMLREF
>   HCRO "&#x"   -- 38 is the number for ampersand --
>   SHORTREF SGMLREF
>  NAMESSGMLREF
>  QUANTITY SGMLREF
>   ATTCNT   60  -- increased --
>   ATTSPLEN 65536   -- These are the largest values --
>   LITLEN   65536   -- permitted in the declaration --
>   NAMELEN  65536   -- Avoid fixed limits in actual --
>   PILEN65536   -- implementations of HTML UA's --
>   TAGLVL   100
>   TAGLEN   65536
>   GRPGTCNT 150
>   GRPCNT   64
>
> FEATURES
>   MINIMIZE
> DATATAG  NO
> OMITTAG  YES
> RANK NO
> SHORTTAG YES
>   LINK
> SIMPLE   NO
> IMPLICIT NO
> EXPLICIT NO
>   OTHER
> CONCUR   NO
> SUBDOC   NO
> FORMAL   YES
>   APPINFO NONE
> >
>
> | That means that it is not possible in HTML (or more importantly, in
> | XML) to represent all valid Unicode characters in data fields.
>
> What would you want to use control characters for in an XML document?
>
> --Lars M.
>
>
>

Re: Is there Unicode mail out there?

2001-07-17 Thread Lars Marius Garshol



* Mark Davis
|
| The HTML spec depends on the SGML spec for a characterization of
| allowable characters. The latter, unfortunately, disallows some
| valid Unicode characters (most C0 controls), but inconsistently
| allows other similar characters (C1 controls). 

SGML is silent on the issue of what characters are allowed. It is the
SGML declaration used by each application which decides this, and you
can easily make an SGML declaration which allows every Unicode
character.

To wit:



| That means that it is not possible in HTML (or more importantly, in
| XML) to represent all valid Unicode characters in data fields.

What would you want to use control characters for in an XML document?

--Lars M.

Re: Is there Unicode mail out there?

2001-07-17 Thread Lars Marius Garshol



* Michael Everson
| 
| Perhaps I have been asleep, but is that notation (&#X;) valid
| HTML for all Unicode characters?

The numeric character reference syntax is defined by SGML, and just
referenced by HTML, and in SGML it is defined in terms of the document
character set, which is defined by the SGML declaration used by each
SGML application (of which HTML is one instance).

The numeric character reference syntax can be used to refer to any
character in the document character set (as declared by the SGML
declaration used by HTML[1]). The document character set used by HTML
is Unicode, but some characters have been disallowed, and may not
appear in documents, whether directly or by reference. These are

 U+ - U+0009
 U+000B - U+000C
 U+000E - U+0019
 U+007F - U+009F
 U+D800 - U+DFFF 

--Lars M.

[1] http://www.w3.org/TR/html401/sgml/sgmldecl.html >

Re: Is there Unicode mail out there?

2001-07-17 Thread Lars Marius Garshol



* Tex Texin
|
| XML restricts the character set which by implication restricts the
| NCR values. I see that > can't use an NCR but < can.

They can both use NCRs. In fact, the example definitions of the
predefined entities do just that:

  http://www.w3.org/TR/REC-xml#sec-predefined-ent >

--Lars M.

[OT] Loco (was Re: FW: Re: Is there Unicode mail out there?)

2001-07-16 Thread Edward Cherlin

At 10:58 AM 2001-07-13, =?ISO-2022-JP?B?GyRCJEYkcyRJJCYkaiRlJCYkOBsoQg==?= 
wrote:
>1) I think that is mojibake for my name. It looks familiar.

I see it a lot too.

>2) The second one reads, if I rightly remember, "Watashi wa loco en la 
>cabeza".

In the 1960's Anne Dinken's Restaurant Kosher in Roppongi advertised itself 
"Yiddish hanashimasu" and "Aqui se habla Yiddish". BTW Anne would have 
agreed with you.

Edward Cherlin
Generalist
"A knot! Oh, do let me help to undo it."
Alice in Wonderland

Re: Is there Unicode mail out there?

2001-07-16 Thread Mark Davis


The HTML spec depends on the SGML spec for a characterization of allowable
characters. The latter, unfortunately, disallows some valid Unicode
characters (most C0 controls), but inconsistently allows other similar
characters (C1 controls). That means that it is not possible in HTML (or
more importantly, in XML) to represent all valid Unicode characters in data
fields.

Mark

—

πάντων μέτρον ἄνθρωπος — Πρωταγόρας
[http://www.macchiato.com]

- Original Message -
From: "Shigemichi Yazawa" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Cc: <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>
Sent: Monday, July 16, 2001 12:12
Subject: Re: Is there Unicode mail out there?


> At Sat, 14 Jul 2001 09:49:30 -0700,
> Mark Davis <[EMAIL PROTECTED]> wrote:
> >
> > No, but it is for the vast majority.
> >
> > Some have to be written specially, e.g. <
>
> I looked at XML 1.0 spec and it says in 2.4 Character Data and Markup
> that
>
> "If they are needed elsewhere, they must be escaped using either
> numeric character references or the strings "&" and "<"
> respectively."
>
> I also looked at HTML 4.01 spec and it doesn't say in 5.3.2 Character
> entity references that < cannot be used to represent "<".
>
> > Some cannot be written at all, e.g. U+0007 (but U+0087 can be!)
>
> This is true for XML, but I couldn't find any statement in HTML 4.01
> spec to restrict the use of U+0007 in HTML document.
>
> By the way, I have been pondering why, in XML, all the C1 control
> characters are legal but some of the C0 control characters are
> not. 2.2 Characters says that "Legal characters are tab, carriage
> return, line feed, and the legal characters of Unicode and ISO/IEC
> 10646." and the BNF for Char is this.
>
> [2] Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] |/* any Unicode
character,
>  [#xE000-#xFFFD] | [#x1-#x10] excluding the surrogate
blocks,
>   FFFE, and . */
>
> Does this mean C0 controls are not legal Unicode characters?
>
> ---
> Shigemichi Yazawa
> [EMAIL PROTECTED]
>
>

Re: Is there Unicode mail out there?

2001-07-16 Thread Shigemichi Yazawa

At Sat, 14 Jul 2001 09:49:30 -0700,
Mark Davis <[EMAIL PROTECTED]> wrote:
> 
> No, but it is for the vast majority.
> 
> Some have to be written specially, e.g. <

I looked at XML 1.0 spec and it says in 2.4 Character Data and Markup
that

"If they are needed elsewhere, they must be escaped using either
numeric character references or the strings "&" and "<"
respectively."

I also looked at HTML 4.01 spec and it doesn't say in 5.3.2 Character
entity references that < cannot be used to represent "<".

> Some cannot be written at all, e.g. U+0007 (but U+0087 can be!)

This is true for XML, but I couldn't find any statement in HTML 4.01
spec to restrict the use of U+0007 in HTML document.

By the way, I have been pondering why, in XML, all the C1 control
characters are legal but some of the C0 control characters are
not. 2.2 Characters says that "Legal characters are tab, carriage
return, line feed, and the legal characters of Unicode and ISO/IEC
10646." and the BNF for Char is this.

[2] Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] |/* any Unicode character,
 [#xE000-#xFFFD] | [#x1-#x10] excluding the surrogate blocks,
  FFFE, and . */

Does this mean C0 controls are not legal Unicode characters?

---
Shigemichi Yazawa
[EMAIL PROTECTED]

Re: Is there Unicode mail out there?

2001-07-15 Thread Mark Davis


yes
- Original Message -
From: "Christopher J Fynn" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Cc: "Mark Davis" <[EMAIL PROTECTED]>
Sent: Saturday, July 14, 2001 22:57
Subject: RE: Is there Unicode mail out there?


>
> Mark Davies wrote:
>
> <<
> Take a look at the XML standard.
>
> Mark
> >>
>
> The thread was discussing HTML. Are there any restrictions on numeric
character references in the *HTML* standard?
>
> - Chris
>
>
>
>
>

Re: Is there Unicode mail out there?

2001-07-14 Thread Tex Texin


Mark,
ok thanks. XML restricts the character set which by implication
restricts the NCR values. I see that > can't use an NCR but <
can.

tex

Mark Davis wrote:
> 
> Take a look at the XML standard.
> 
> Mark
> - Original Message -
> From: "Tex Texin" <[EMAIL PROTECTED]>
> > Hi. I am not sure why you say this. < is often used for "<"
> > but < works in both IE 5 and Netscape 4.7.
> >
> >  shows a box though...
> >
> > But I was not aware of any restrictions on numeric character
> > references. Is there a list of restrictions somewhere?
> > tex

> > Mark Davis wrote:
> > > No, but it is for the vast majority.
> > > Some have to be written specially, e.g. <
> > > Some cannot be written at all, e.g. U+0007 (but U+0087 can be!)

-- 
---
Tex Texin  Director, International Business
mailto:[EMAIL PROTECTED]  +1-781-280-4271
Fax:+1-781-280-4655
the Progress Company   14 Oak Park, Bedford, MA 01730
---

RE: Is there Unicode mail out there?

2001-07-14 Thread Christopher J Fynn



Mark Davies wrote:

<< 
Take a look at the XML standard.

Mark
>>

The thread was discussing HTML. Are there any restrictions on numeric character 
references in the *HTML* standard?

- Chris

RE: Is there Unicode mail out there?

2001-07-14 Thread Christopher J Fynn


Gaute B Strokkenes wrote:

<< ...
That's the only benefit that Unicode and UTF-8 will bring to email:
the ability to mix and match characters from all scripts of all sizes
and shapes in a single message.  OTOH, for those of us who need this
it's a big advantage.
>>

There are also a number of scripts which don't have any registered 
encoding or code-page except Unicode / ISO-10646 - for users of those
scripts, whether or not they want to mix characters from other 
scripts, Unicode / UTF-8 is the only real choice (unless they want to 
use some non-standard font based encoding).

However, since many of these scripts are also complex scripts, 
clients need to be able to render them properly to be of much use
with these scripts. 

- Chris

Re: Is there Unicode mail out there?

2001-07-14 Thread Mark Davis


Take a look at the XML standard.

Mark
- Original Message - 
From: "Tex Texin" <[EMAIL PROTECTED]>
To: "Mark Davis" <[EMAIL PROTECTED]>
Cc: <[EMAIL PROTECTED]>; "Michael Everson" <[EMAIL PROTECTED]>
Sent: Saturday, July 14, 2001 21:15
Subject: Re: Is there Unicode mail out there?


> Mark, 
> Hi. I am not sure why you say this. < is often used for "<"
> but < works in both IE 5 and Netscape 4.7.
> 
>  shows a box though...
> 
> But I was not aware of any restrictions on numeric character
> references. Is there a list of restrictions somewhere?
> tex
> 
> 
> Mark Davis wrote:
> > 
> > No, but it is for the vast majority.
> > 
> > Some have to be written specially, e.g. <
> > 
> > Some cannot be written at all, e.g. U+0007 (but U+0087 can be!)
> > 
> > Mark
> > - Original Message -
> > From: "Michael Everson" <[EMAIL PROTECTED]>
> > To: <[EMAIL PROTECTED]>
> > Sent: Saturday, July 14, 2001 05:10
> > Subject: Re: Is there Unicode mail out there?
> > 
> > > At 11:07 -0400 2001-07-13, Tex Texin wrote:
> > >
> > > >Maybe writing the value as an HTML numeric character reference (e.g.
> > > >€) would also make it easier for processes reading files
> > > >saved by the mailer
> > > >to recover the character.
> > >
> > > Perhaps I have been asleep, but is that notation (&#X;) valid
> > > HTML for all Unicode characters?
> > > --
> > > Michael Everson
> 
> -- 
> ---
> Tex Texin  Director, International Business
> mailto:[EMAIL PROTECTED]  +1-781-280-4271
> Fax:+1-781-280-4655
> the Progress Company   14 Oak Park, Bedford, MA 01730
> ---
> 
>

Re: Is there Unicode mail out there?

2001-07-14 Thread Tex Texin


Mark, 
Hi. I am not sure why you say this. < is often used for "<"
but < works in both IE 5 and Netscape 4.7.

 shows a box though...

But I was not aware of any restrictions on numeric character
references. Is there a list of restrictions somewhere?
tex


Mark Davis wrote:
> 
> No, but it is for the vast majority.
> 
> Some have to be written specially, e.g. <
> 
> Some cannot be written at all, e.g. U+0007 (but U+0087 can be!)
> 
> Mark
> - Original Message -
> From: "Michael Everson" <[EMAIL PROTECTED]>
> To: <[EMAIL PROTECTED]>
> Sent: Saturday, July 14, 2001 05:10
> Subject: Re: Is there Unicode mail out there?
> 
> > At 11:07 -0400 2001-07-13, Tex Texin wrote:
> >
> > >Maybe writing the value as an HTML numeric character reference (e.g.
> > >€) would also make it easier for processes reading files
> > >saved by the mailer
> > >to recover the character.
> >
> > Perhaps I have been asleep, but is that notation (&#X;) valid
> > HTML for all Unicode characters?
> > --
> > Michael Everson

-- 
---
Tex Texin  Director, International Business
mailto:[EMAIL PROTECTED]  +1-781-280-4271
Fax:+1-781-280-4655
the Progress Company   14 Oak Park, Bedford, MA 01730
---

Re: Is there Unicode mail out there?

2001-07-14 Thread Gaute B Strokkenes

On Sat, 14 Jul 2001, [EMAIL PROTECTED] wrote:
>> > How about just supporting these: ISO646-PT, ISO10646-UTF-1,
>> > NATS-SEFI and HP-DeskTop?
>>
>> I'm not sure what you're trying to say here.  Assuming these are
>> properly registered charsets, it seems like a very narrow range to
>> support.
> 
> Maybe "supporting at least these" would have been a better
> phrasing. They're all valid and registered MIME-charsets. Do you
> know of a single mailer that supports all 4?

OK, I get your point.  There are a lot of obscure charsets out there,
and it's probably not necessary to make sure that mail clients
understand all of them since a lot of these have no precedent for use
in email.  Nevertheless, there are a number of charsets--ISO-8859-1,
ISO-8859-2, KOI8-R, Shift_JIS and so on--that have widespread
precedent for use in email, and are de-facto standards for email in
certain languages.  It would be extremely foolish to implement a mail
client that understands UTF-8 but not these.

>> If we all had to upgrade our software to do so, I think a lot of
>> people just wouldn't bother.
> 
> You're claiming on one hand that everyone's mailer should handle all
> sorts of charsets, and on the other using one that doesn't support
> the only charset that is RFC-mandated for a working mail program to
> support.

I'm sorry, but you're mixing things up a bit.  Keep in mind that in
general there is a difference between what processes implementing
Internet protocols should generate and what they are required to
accept.  One of the principles that the Internet is founded on is to
"be liberal in what you accept, and conservative in what you produce".

> (Yes, a mailer that doesn't handle UTF-8 violates the appropriate
> RFCs.)

Chapter and verse, please?  The only document I could find that puts
forth such a requirement is the one at:

  http://www.imc.org/mail-i18n.html

which is not a RFC.  Other than that, there is RFC 2277; however this
only states that protocols must make it possible to exchange textual
data using UTF-8; it doesn't make it mandatory to understand UTF-8.

RFC 2049 only states that US-ASCII must be understood, and the same
for the ISO-8859-X charsets, except that you're not required to be
able to display the non-ASCII characters they contain.  There's no
mention of UTF-8.

If you have any better references, please provide them.  (I do not
claim to have encyclopedic knowledge off the subject.)

Note that the IMC document does not encourage mail clients to produce
UTF-8 by default, it only states that mail clients should be able to
interpret it and given users the option to create messages in UTF-8.
It explicitly recognises that that few mail clients implemented good
UTF-8 support at the time.  That was three years ago, and little has
changed since.  It is only very recently that good UTF-8 support has
become standard for new clients, and there are still lots and lots of
old clients that have no UTF-8 support at all.  It is certainly clear
that the time scale hinted at in the document (that all mail clients
created or revised after 1 January 1999 should be able to interpret
UTF-8) was hopelessly optimistic.  We're not there yet, even though
we're getting closer.

>> It's the closest thing that we have to a common _universal_
>> charset.
> 
> You sure? Besides ASCII, what other charset can almost everyone read
> (including the people who cut and paste into Unicode editors,
> because they can read it)? There's no other charset (besides ASCII)
> that everyone with a working mailer, no matter how minimal, can
> read.

Well, I'm saying that UTF-8 / Unicode is the closest thing that we
have to a universal charset.  (I meant "universal" as in "universal
character repertoire", not "universally supported".)  There are many
charsets that are better supported in general than UTF-8; ASCII and
ISO-8859-1 are two of them.

However, the problem in question is not to choose the "best" charset
in general, but to choose the best possible charset for a given
message containing a given set of characters.  RFC 2046 states:

   More generally, if a widely-used character set is a subset of
   another character set, and a body contains only characters in the
   widely-used subset, it should be labelled as being in that subset.
   This will increase the chances that the recipient will be able to
   view the resulting entity correctly.

I think this is good advice.  Consider the scenario where a group of
people are accustomed to exchanging email in the language of their
choice in a particular charset with little difficulty.  Then some
members of the group upgrade their software, and the other members of
the group can then no longer read their messages, since the new
software insists on using UTF-8 (which the older software does not
support).  That's bad, and the above advice avoids this situation.

-- 
Gaute Strokkeneshttp://www.srcf.ucam.org/~gs234/
I'm thinking about DIGITAL READ-OUT systems and
 computer-gene

Re: Is there Unicode mail out there?

2001-07-14 Thread David Starner

From: Gaute B Strokkenes <[EMAIL PROTECTED]>
> On Sat, 14 Jul 2001, [EMAIL PROTECTED] wrote:
> > From: Gaute B Strokkenes <[EMAIL PROTECTED]>
> >> No way.  Any mail client that is sufficiently clever to understand
> >> UTF-8 should understand all valid and registered MIME-charsets.
> >> After all, conversion libraries are both widely available and easy
> >> to use.
> >
> > Do you know of any that actually do?
>
> Actually do convert messages in arbitrary charsets to UTF-8 / Unicode,
> you mean?

No, I mean "understand all valid and registered MIME-charsets".

> > How about just supporting these: ISO646-PT, ISO10646-UTF-1,
> > NATS-SEFI and HP-DeskTop?
>
> I'm not sure what you're trying to say here.  Assuming these are
> properly registered charsets, it seems like a very narrow range to
> support.

Maybe "supporting at least these" would have been a better phrasing. They're
all valid and registered MIME-charsets. Do you know of a single mailer that
supports all 4?

> If we all had to upgrade
> our software to do so, I think a lot of people just wouldn't bother.

You're claiming on one hand that everyone's mailer should handle all sorts
of charsets, and on the other using one that doesn't support the only
charset that is RFC-mandated for a working mail program to support. (Yes, a
mailer that doesn't handle UTF-8 violates the appropriate RFCs.)

> It's the closest thing that we have to a common _universal_ charset.

You sure? Besides ASCII, what other charset can almost everyone read
(including the people who cut and paste into Unicode editors, because they
can read it)? There's no other charset (besides ASCII) that everyone with a
working mailer, no matter how minimal, can read.

--
David Starner - [EMAIL PROTECTED]

Re: Is there Unicode mail out there?

2001-07-14 Thread G. Adam Stanislav


At 12:03 2001-07-13 EDT, [EMAIL PROTECTED] wrote:
>Unfortunately, the Windows world has no concept of a Last Resort font.  It 
>would certainly seem to be a useful solution in cases like this.

Does a PostScript, Type 1, version of such a font exist for
download somewhere?

Adam
--- 
http://phonecowboy.com/registrar/twist/ finds a good domain for you
and checks for its existence.

Re: Is there Unicode mail out there?

2001-07-14 Thread Michael \(michka\) Kaplan


From: "Michael Everson" <[EMAIL PROTECTED]>

> Then it's not standard and can't be relied upon. Pity.

Actually, it is a standard, as of HTML 4.0. All you need is compliant
browser.

MichKa

Michael Kaplan
Trigeminal Software, Inc.
http://www.trigeminal.com/

Re: Is there Unicode mail out there?

2001-07-14 Thread Michael \(michka\) Kaplan



michka

the only book on internationalization in VB at
http://www.i18nWithVB.com/

- Original Message -
From: "Michael Everson" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Saturday, July 14, 2001 9:56 AM
Subject: Re: Is there Unicode mail out there?


> At 09:49 -0700 2001-07-14, Mark Davis wrote:
>
> >  > >Maybe writing the value as an HTML numeric character reference (e.g.
> >  > >€) would also make it easier for processes reading files
> >  > >saved by the mailer
> >  > >to recover the character.
> >  >
> >  > Perhaps I have been asleep, but is that notation (&#X;) valid
> >  > HTML for all Unicode characters?
> >
> >No, but it is for the vast majority.
> >
> >Some have to be written specially, e.g. <
> >
> >Some cannot be written at all, e.g. U+0007 (but U+0087 can be!)
>
> Then it's not standard and can't be relied upon. Pity.
> --
> Michael Everson
>
>

Re: Is there Unicode mail out there?

2001-07-14 Thread Michael Everson


At 09:49 -0700 2001-07-14, Mark Davis wrote:

>  > >Maybe writing the value as an HTML numeric character reference (e.g.
>  > >€) would also make it easier for processes reading files
>  > >saved by the mailer
>  > >to recover the character.
>  >
>  > Perhaps I have been asleep, but is that notation (&#X;) valid
>  > HTML for all Unicode characters?
>
>No, but it is for the vast majority.
>
>Some have to be written specially, e.g. <
>
>Some cannot be written at all, e.g. U+0007 (but U+0087 can be!)

Then it's not standard and can't be relied upon. Pity.
-- 
Michael Everson

Re: Is there Unicode mail out there?

2001-07-14 Thread Daniel Biddle


On Sat, Jul 14, 2001 at 01:10:15PM +0100, Michael Everson wrote:
> At 11:07 -0400 2001-07-13, Tex Texin wrote:
> 
> >Maybe writing the value as an HTML numeric character reference (e.g. 
> >€) would also make it easier for processes reading files 
> >saved by the mailer
> >to recover the character.
> 
> Perhaps I have been asleep, but is that notation (&#X;) valid 
> HTML for all Unicode characters?

Since HTML 4, yes: http://www.w3.org/TR/html4/charset.html#h-5.3.1

-- 
Daniel Biddle <[EMAIL PROTECTED]>

Re: Is there Unicode mail out there?

2001-07-14 Thread Mark Davis


No, but it is for the vast majority.

Some have to be written specially, e.g. <

Some cannot be written at all, e.g. U+0007 (but U+0087 can be!)

Mark
- Original Message - 
From: "Michael Everson" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Saturday, July 14, 2001 05:10
Subject: Re: Is there Unicode mail out there?


> At 11:07 -0400 2001-07-13, Tex Texin wrote:
> 
> >Maybe writing the value as an HTML numeric character reference (e.g. 
> >€) would also make it easier for processes reading files 
> >saved by the mailer
> >to recover the character.
> 
> Perhaps I have been asleep, but is that notation (&#X;) valid 
> HTML for all Unicode characters?
> -- 
> Michael Everson
> 
>

Re: Is there Unicode mail out there?

2001-07-14 Thread Michael Everson


At 11:07 -0400 2001-07-13, Tex Texin wrote:

>Maybe writing the value as an HTML numeric character reference (e.g. 
>€) would also make it easier for processes reading files 
>saved by the mailer
>to recover the character.

Perhaps I have been asleep, but is that notation (&#X;) valid 
HTML for all Unicode characters?
-- 
Michael Everson

Re: Is there Unicode mail out there?

2001-07-14 Thread Gaute B Strokkenes

On Sat, 14 Jul 2001, [EMAIL PROTECTED] wrote:
> From: Gaute B Strokkenes <[EMAIL PROTECTED]>
>> No way.  Any mail client that is sufficiently clever to understand
>> UTF-8 should understand all valid and registered MIME-charsets.
>> After all, conversion libraries are both widely available and easy
>> to use.
> 
> Do you know of any that actually do?

Actually do convert messages in arbitrary charsets to UTF-8 / Unicode,
you mean?  Any reasonably modern mail client will.  IIRC Microsoft OE
and friends do everything in Unicode internally and only convert to
other encodings when receiving or sending mail.  (Though OE is broken
in so many other ways that I wouldn't recommend it.)  Gnus/Emacs does
too (actually it uses the Emacs MULE encoding internally, but from the
users perspective the effect is precisely the same).

My argument is based on the fact that if you have put in the necessary
work to interpret UTF-8 messages, then it does not take at all that
much extra effort to interpret messages in other charsets by running
them through a converter first.  I postulate that libraries to perform
this function are both widely available and highly portable; if you do
not agree then I would be happy to point out concrete examples.

> How about just supporting these: ISO646-PT, ISO10646-UTF-1,
> NATS-SEFI and HP-DeskTop?

I'm not sure what you're trying to say here.  Assuming these are
properly registered charsets, it seems like a very narrow range to
support.  If they're not, then they have no place in email whatsoever
(and UTF-8 is clearly a better choice.)

> I don't think anyone was suggesting that for all lists. However,
> here, on the Unicode list, everyone on the list should be able to
> handle Unicode, and those who can have sometimes been willing to cut
> and paste into a Unicode editor just to see what's up.

I don't think that holds.  People on the unicode list are not
necessarily Unicode boffins, although a lot of the active people are.
Some of us are just here because we have an interest in, say, i18n in
general and like to keep an eye on things.  If we all had to upgrade
our software to do so, I think a lot of people just wouldn't bother.
That way, everyone loses.

Note that I think it is appropriate to use UTF-8 when there's just no
common charset that can represent a given message.

> Legacy encodings should be used when you're communicating with
> people who use legacy encodings and legacy mail readers.  Unicode
> people don't - after ASCII, UTF-8 is probably the closest thing we
> have to a common usable encoding.

It's the closest thing that we have to a common _universal_ charset.
For messages that do not require the `universal' property, there are
many charsets that are just as sensible and, more to the point, much
better supported.

-- 
Gaute Strokkeneshttp://www.srcf.ucam.org/~gs234/
Yow!  Am I in Milwaukee?

Re: Is there Unicode mail out there?

2001-07-14 Thread David Starner


From: Gaute B Strokkenes <[EMAIL PROTECTED]>
> No way.  Any mail client that is sufficiently clever to understand
> UTF-8 should understand all valid and registered MIME-charsets.  After
> all, conversion libraries are both widely available and easy to use.

Do you know of any that actually do? How about just supporting these:
ISO646-PT, ISO10646-UTF-1, NATS-SEFI and HP-DeskTop?

> All the `all messages should be in UTF-8, even when there are
> well-established legacy encodings that cover the characters of a given
> message' mumbo-jumbo that has been mentioned recently on the list is
> really just so much hot air.

I don't think anyone was suggesting that for all lists. However, here, on
the Unicode list, everyone on the list should be able to handle Unicode, and
those who can have sometimes been willing to cut and paste into a Unicode
editor just to see what's up. Legacy encodings should be used when you're
communicating with people who use legacy encodings and legacy mail readers.
Unicode people don't - after ASCII, UTF-8 is probably the closest thing we
have to a common usable encoding.

--
David Starner - [EMAIL PROTECTED]

Re: Is there Unicode mail out there?

2001-07-13 Thread Gaute B Strokkenes

On Fri, 13 Jul 2001, [EMAIL PROTECTED] wrote:
> 
>> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]] 
> 
>> Those are MOJIBAKE for my SIG.
> 
>   Which is what you deserve for not sending UTF-8.  Until you
> upgrade your mailer, your name wil be 
>"?@?š‚¶‚ã‚¤‚¢‚Á‚¿‚á‚ñ?š".  :-p

No way.  Any mail client that is sufficiently clever to understand
UTF-8 should understand all valid and registered MIME-charsets.  After
all, conversion libraries are both widely available and easy to use.
[I can see you put a smiley after your statement so I realise you were
probably being sarcastic, but I thought that this could bear pointing
out.]

All the `all messages should be in UTF-8, even when there are
well-established legacy encodings that cover the characters of a given
message' mumbo-jumbo that has been mentioned recently on the list is
really just so much hot air.  Firstly, mail clients will not be able
to deprecate support for other charsets even if UTF-8 is widely
adopted (which it isn't--for email) because of the need to be able to
interpret the masses of existing messges.  Secondly, maintaining such
support is, as pointed out above, extremely easy to do.  Thirdly,
there are a great number of clients out there that do not support
UTF-8 and are unlikely to do so in the immediate future, either
because of internal limitations in the software that are hard to
remove or because people don't upgrade.  I think it's antisocial to
say `Well, I _could_ have used a charset that would have enabled you
to read my message but I decided not to, for no particularly good
reason.'

On the other hand it makes sense to say `Sorry, but UTF-8 is the only
charset that will do since I wanted to use Etruscan, Russian and
Japanese characters and UTF-8 is the only sane way to do this.'
That's the only benefit that Unicode and UTF-8 will bring to email:
the ability to mix and match characters from all scripts of all sizes
and shapes in a single message.  OTOH, for those of us who need this
it's a big advantage.

Another thing that some people may worry about is the bad interaction
between quoted-unprintable and UTF-8 (or any non-West European / North
American coding in general, but for UTF-8 it's even worse): 6 bytes
for a single Cyrillic character?  Ye gods.  [I could start another
rant about how bad an idea QP was in the first place, but that's
off-topic here.]

-- 
Gaute Strokkeneshttp://www.srcf.ucam.org/~gs234/
I am NOT a nut

RE: Re: Is there Unicode mail out there?

2001-07-13 Thread Kenneth Whistler


Mike Ayers challenged:

> Here's some for you to transliterate:
>
> 1.)  The bull's nose ring is where we attach the taurine towline.

Easy:

za buuruzu noozu ringu izu fuea ui ataatchu za toorin toorainu.

Look's fine to me!

--Ken

Re: Is there Unicode mail out there?

2001-07-13 Thread James Kass


Rick McGowan wrote:

> Eeek.. What's that?  11's comment shows up fine
> in my mail reader here, as Japanese chars.  But
> what I got was, I believe, "watashi wa rokoenrakabesa"
> which isn't any Japanese that I can parse, and it
> should have a comma after "wa" in any case.
> "Roko" isn't a word, though "rouko" and "roukou"
> are (and don't make sense here).  "Besa" isn't a verb
> ending, even in classical Japanese, and I can't imagine
> what it's supposed to mean.  "Enraka" isn't a word,
> and "koen" isn't a word though "kouen" is...  Hm.
> It's gibberish anyway, so it wouldn't matter if it
> came through.

How's your Spanish, Rick?

Try "watashi wa" as Japanese and "roko en ra kabesa"
as Spanish...  (keeping in mind that Japanese doesn't
distinguish between "r" and "l", of course.)

Best regards,

James Kass.
â€‹

RE: Re: Is there Unicode mail out there?

2001-07-13 Thread Ayers, Mike



> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]] 

> Those are MOJIBAKE for my SIG.

Which is what you deserve for not sending UTF-8.  Until you upgrade
your mailer, your name wil be "@š‚¶‚ã‚¤‚¢‚Á‚¿‚á‚ñš".   
 :-p

> 1) I think that is mojibake for my name. It looks familiar.

See above.

> 2) The second one reads, if I rightly remember, "Watashi wa 
> loco en la cabeza".

Yep: 私はろこえんらかべさ。You're still hung up on this "use kana 
to 1represent
any language" thing, huh?  You've got that in common with the Japanese - I
was quite surprised to find that most Japanese don't know that their
katakana versions of English words don't sound much like English words.
Anyway, if I ever meet a Spanish and Japanese fluent individual, I'll wave
it under their nose to see if they catch it.  They won't, though, since
you're using hiragana instead of katakana.

Here's some for you to transliterate:

1.)  The bull's nose ring is where we attach the taurine
towline.
2.)  Raul studies lore.
3.)  My file said he was vile.
4.)  Fu did that.  Fu who?

Etc., etc., etc.

> If I get a mojibakus or two in a Chinese sig, I don't say 
> anything. (Is mojibakus the singular of mojibake? Perhaps 
> "mojibakum"?)

You're the Japanese enthusiast - look it up!


/|/|ike

Re: FW: Re: Is there Unicode mail out there?

2001-07-13 Thread Rick McGowan


> Watashi wa loco en la cabeza

Duh, well, use katakana as appropriate, use middle-dots between your foreign  
words, and people might get it.

Rick

FW: Re: Is there Unicode mail out there?

2001-07-13 Thread

Those are MOJIBAKE for my SIG.

1) I think that is mojibake for my name. It looks familiar.

2) The second one reads, if I rightly remember, "Watashi wa loco en la cabeza".


If I get a mojibakus or two in a Chinese sig, I don't say anything. (Is mojibakus the 
singular of mojibake? Perhaps "mojibakum"?)


$B$8$e$&$$$C$A$c$s(B

--- Original Message ---
$B:9=P?M(B: [EMAIL PROTECTED];
$B08@h(B: [EMAIL PROTECTED];
Cc: [EMAIL PROTECTED];
$BF|;~(B: 01/07/13 15:29
$B7oL>(B: Re: Is there Unicode mail out there?

>In a message dated 2001-07-13 5:27:41 Pacific Daylight Time, [EMAIL PROTECTED] 
>writes:
>
>>  $B%D!!%D!&!#c`TD%+c`TE!Wc`TD!"c`TD!Vc`TE"d?TD%=c`TE!#c`TE%"%D!&!#(B
>>
>>  $B%D!!%J%9c`\d?TE:d?TE%)c`TD%"c`TD%rc`TE%"c`TE%!c`TD%%c`TENd?TD%&%D!#(B
>
>Robert, please stop this.  It doesn't seem to be UTF-8 (that is, I can't copy 
>and paste it into UniPad or Windows 2000 Notepad and see anything 
>reasonable), and even if it were, neither I nor many other list members can 
>read Japanese.  We had this discussion earlier in the year about English vs. 
>French, and other than exceptions like Patrick Andries' message (which was 
>explicitly about a French translation), this is basically an English-language 
>list.  It is certainly cool to ask questions about this or that Japanese 
>character, but simply posting an unreadable Japanese response to my 
>English-language message makes no sense.
>
>-Doug Ewell
> Fullerton, California
>
>

Re: Is there Unicode mail out there?

2001-07-13 Thread Rick McGowan


Doug Ewell wrote...

> >  @š‚¶‚ã‚¤‚¢‚Á‚¿‚á‚ñš
> >  @Ž„‚Í‚ë‚±‚¦‚ñ‚ç‚©‚×‚³B
>
> Robert, please stop this.  It doesn't seem to be UTF-8 (that is, I can't copy  
> and paste it into UniPad or Windows 2000 Notepad and see anything
> reasonable)

Eeek.. What's that?  11's comment shows up fine in my mail reader here, as  
Japanese chars.  But what I got was, I believe, "watashi wa rokoenrakabesa"  
which isn't any Japanese that I can parse, and it should have a comma after  
"wa" in any case.  "Roko" isn't a word, though "rouko" and "roukou" are (and  
don't make sense here).  "Besa" isn't a verb ending, even in classical  
Japanese, and I can't imagine what it's supposed to mean.  "Enraka" isn't a  
word, and "koen" isn't a word though "kouen" is...  Hm.  It's gibberish  
anyway, so it wouldn't matter if it came through.

Just looks like nearly random syllables generated by someone who doesn't  
write the language.

Rick

RE: Is there Unicode mail out there?

2001-07-13 Thread Ayers, Mike

> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]] 

> In a message dated 2001-07-13 5:27:41 Pacific Daylight Time, 
> [EMAIL PROTECTED] 
> writes:
> 
> >  @š‚¶‚ã‚¤‚¢‚Á‚¿‚á‚ñš
> >
> >  @Ž„‚Í‚ë‚±‚¦‚ñ‚ç‚©‚×‚³B
> 
> Robert, please stop this.  It doesn't seem to be UTF-8 (that 
> is, I can't copy 
> and paste it into UniPad or Windows 2000 Notepad and see anything 

It's ISO-2022-JP, if that helps.

> character, but simply posting an unreadable Japanese response to my 
> English-language message makes no sense.

Ever think that maybe that's why he does it?  Anyway, here's a hint.
As someone who can read a little Japanese, I have never translated anything
in one of 11DB's messages that really mattered.  Anything that he wants us
to see is put in English, so you can probably safely ignore the question
marks.

On the other hand...

Yo, 11DB, get with the program!  Use UTF-8: where do ya think ya
are?  The point is to confuse people, not frustrate them.   ;-)

/|/|ike

Re: Is there Unicode mail out there?

2001-07-13 Thread DougEwell2

In a message dated 2001-07-13 4:06:39 Pacific Daylight Time, 
[EMAIL PROTECTED] writes:

>  For "ordinary users", i. e., those users who don't have the TUS 3.0
>  tome lying next to their computers, a "last resort glyph" would
>  probably be more helpful, cf. 
>  and .

Unfortunately, the Windows world has no concept of a Last Resort font.  It 
would certainly seem to be a useful solution in cases like this.

-Doug Ewell
 Fullerton, California

Re: Is there Unicode mail out there?

2001-07-13 Thread DougEwell2

In a message dated 2001-07-13 5:27:41 Pacific Daylight Time, [EMAIL PROTECTED] 
writes:

>  @š‚¶‚ã‚¤‚¢‚Á‚¿‚á‚ñš
>
>  @Ž„‚Í‚ë‚±‚¦‚ñ‚ç‚©‚×‚³B

Robert, please stop this.  It doesn't seem to be UTF-8 (that is, I can't copy 
and paste it into UniPad or Windows 2000 Notepad and see anything 
reasonable), and even if it were, neither I nor many other list members can 
read Japanese.  We had this discussion earlier in the year about English vs. 
French, and other than exceptions like Patrick Andries' message (which was 
explicitly about a French translation), this is basically an English-language 
list.  It is certainly cool to ask questions about this or that Japanese 
character, but simply posting an unreadable Japanese response to my 
English-language message makes no sense.

-Doug Ewell
 Fullerton, California

Re: Is there Unicode mail out there?

2001-07-13 Thread Tex Texin

Doug,
I thought I had acknowledged the rationale for supporting labeling
the message with the
minimal charset based on each message's contents in the beginning
of the third paragraph, but maybe I should have expanded on it.
Anyway, despite the benefit it is a significant problem that
it is unreliable and that "past performance does not
predict future performance" or whatever the phrase is that the
financial markets use.

I was mostly stage setting for the idea that there should be a
clear indicator for a failed character conversion. The last resort
proposal is ok. I agree with you about seeing the hex value for the
missing
character with the symbol. (I've already been forced to learn the
unicode codepoint for the Euro by heart... I would probably
recognize
most of the commonly failed characters if the code points were
available.) Maybe writing the value
as an HTML numeric character reference (e.g. €) would also
make it easier for processes reading files saved by the mailer 
to recover the character. (By using a "standard representation" and
also one that is not likely to appear in an email, unless the email
is
about character references...)
For the unicode-unaware the syntax could allow inclusion of the
original
code page label: €:windows1256;

Anyway, this problem that characters that do not convert in mails
are not being clearly indicated:
occurs frequently, 
can have significant impact to users,
seems to have some cheap workarounds,
that are better than either just relabeling to the lowest common
denominator or 
preventing communications entirely.

tex

[EMAIL PROTECTED] wrote:
> 
> In a message dated 2001-07-12 8:55:07 Pacific Daylight Time,
> [EMAIL PROTECTED] writes:
> 
> >  So the proposal is that minimizing the charset is a good thing?
> >
> >  This means that you and I start out in a conversation about a
> >  product I am trying to sell you, it happens to be all in ascii
> >  and we exchange several mails successfully. Then I quote you
> >  a price in Euros and my 1252 message gets corrupted by your
> >  reader which can handle either only 8859-1 or ASCII, and
> >  you miss the fact that the Euro is corrupted and think we
> >  are talking dollars or some other currency.
> >
> >  Although I understand why you would want a minimal charset in order
> >  to not needlessly prevent communications, the implication of
> >  reliability and trust that is built by having some success is
> >  a problem. You think you are communicating successfully but when it
> >  is critical it may not...
> 
> The premise seems to be that we should reject, or at least issue a warning
> against, the earlier messages on the basis that the sender *might* be able to
> send characters in the future that the receiver could not receive.  Sorry,
> but I can't buy into that.  That would prevent the CP1252 user from ever
> being able to communicate adequately with anyone who has "only" ISO 8859-1.
> 
> What if I am trying to exchange mail with a user of Windows-1256?  Lots of
> roadblocks would be erected because of the chance that the guy *might* send
> me ARABIC LETTER ALEF WITH HAMZA BELOW and I couldn't interpret it.  And I
> couldn't exchange mail with UTF-8 users either, because of that YI SYLLABLE
> BBOP they might send me some day.
> 
> >  Perhaps if a harder line was taken when characters
> >  are used that cannot be converted, this would make more sense.
> >  (ie give a very clear recognizable indication of corruption or
> >  conversion failures)
> 
> That's reasonable.  Simply replacing unknown characters with '?' doesn't
> work; the character is too easily overlooked.  I would like to see mailers
> replace unsupported characters with a Unicode representation like "[U+A068]".
>  (That would certainly help with this spate of CJK characters that people are
> sending lately on the Unicode list!)  I suspect that's too much Unicode
> awareness to ask of an otherwise Unicode-unaware product, though.
> 
> -Doug Ewell
>  Fullerton, California

-- 
---
Tex Texin  Director, International Business
mailto:[EMAIL PROTECTED]  +1-781-280-4271
Fax:+1-781-280-4655
the Progress Company   14 Oak Park, Bedford, MA 01730
---

Re: Is there Unicode mail out there?

2001-07-13 Thread




$B!!!z$8$e$&$$$C$A$c$s!z(B

$B!!;d$O$m$3$($s$i$+$Y$5!#(B


>
>Am 2001-07-13 um 2:53 h EDT hat Doug Ewell geschrieben:
>> Simply replacing unknown characters with '?' doesn't work; the
>> character is too easily overlooked.  I would like to see mailers
>> replace unsupported characters with a Unicode representation
>> like "[U+A068]".
>
>For "ordinary users", i. e., those users who don't have the TUS 3.0
>tome lying next to their computers, a "last resort glyph" would
>probably be more helpful, cf. 
>and .
>
>Best wishes,
>  Otto Stolz
>
>

They can look it up online.

Yes, it is a tome
Not just a book, a TOME.

Re: Is there Unicode mail out there?

2001-07-13 Thread Otto Stolz


Tex Texin hatte geschrieben:
> Perhaps if a harder line was taken when characters
> are used that cannot be converted, this would make more sense.
> (ie give a very clear recognizable indication of corruption or
> conversion failures)

Am 2001-07-13 um 2:53 h EDT hat Doug Ewell2 geschrieben:
> That's reasonable.

One problem still is mis-labelled mail. On Wed, 13 Jun 2001,
e. g., I filed a bug report against Eudora 5.1 (the latest
version), quote:
| - Eudora 5.1 sends proprietary Microsoft encoding CP-1252,
|   erroneously labelled as charset="iso-8859-1".

Am 2001-07-13 um 2:53 h EDT hat Doug Ewell geschrieben:
> Simply replacing unknown characters with '?' doesn't work; the
> character is too easily overlooked.  I would like to see mailers
> replace unsupported characters with a Unicode representation
> like "[U+A068]".

For "ordinary users", i. e., those users who don't have the TUS 3.0
tome lying next to their computers, a "last resort glyph" would
probably be more helpful, cf. 
and .

Best wishes,
  Otto Stolz

Re: Is there Unicode mail out there?

2001-07-13 Thread DougEwell2

In a message dated 2001-07-12 8:55:07 Pacific Daylight Time, 
[EMAIL PROTECTED] writes:

>  So the proposal is that minimizing the charset is a good thing?
>
>  This means that you and I start out in a conversation about a
>  product I am trying to sell you, it happens to be all in ascii
>  and we exchange several mails successfully. Then I quote you
>  a price in Euros and my 1252 message gets corrupted by your
>  reader which can handle either only 8859-1 or ASCII, and
>  you miss the fact that the Euro is corrupted and think we
>  are talking dollars or some other currency.
>
>  Although I understand why you would want a minimal charset in order
>  to not needlessly prevent communications, the implication of
>  reliability and trust that is built by having some success is
>  a problem. You think you are communicating successfully but when it
>  is critical it may not...

The premise seems to be that we should reject, or at least issue a warning 
against, the earlier messages on the basis that the sender *might* be able to 
send characters in the future that the receiver could not receive.  Sorry, 
but I can't buy into that.  That would prevent the CP1252 user from ever 
being able to communicate adequately with anyone who has "only" ISO 8859-1.

What if I am trying to exchange mail with a user of Windows-1256?  Lots of 
roadblocks would be erected because of the chance that the guy *might* send 
me ARABIC LETTER ALEF WITH HAMZA BELOW and I couldn't interpret it.  And I 
couldn't exchange mail with UTF-8 users either, because of that YI SYLLABLE 
BBOP they might send me some day.

>  Perhaps if a harder line was taken when characters
>  are used that cannot be converted, this would make more sense.
>  (ie give a very clear recognizable indication of corruption or
>  conversion failures)

That's reasonable.  Simply replacing unknown characters with '?' doesn't 
work; the character is too easily overlooked.  I would like to see mailers 
replace unsupported characters with a Unicode representation like "[U+A068]". 
 (That would certainly help with this spate of CJK characters that people are 
sending lately on the Unicode list!)  I suspect that's too much Unicode 
awareness to ask of an otherwise Unicode-unaware product, though.

-Doug Ewell
 Fullerton, California

Re: Is there Unicode mail out there?

2001-07-12 Thread James Kass

Chris Wendt wrote:

> Replying in the charset of the original message
> is in my view reasonable behavior: the recipient
> of your reply has the best chance to read the
> message in the encoding the original message
> was sent. Changing the encoding decreases the
> chance the replyee will be able to read your
> message.

When a user issues an instruction to a computer, it
is a command rather than a request.  If a user selects
the option to "Use the following default encoding for
outgoing messages:", then the expected behavior is
compliance.

Of course, you are quite right in that the recipient
is more likely to be able to read a message sent in the
recipient's default.  As we move towards a World encoding
standard, perhaps more applications will use the standard
as default.

This message is being sent in Arabic (Windows) because
it is in reponse to a message sent in that encoding.  The
author of the original message has noted my work-around
and has cleverly prevented it by selecting a code-page
which includes the special character I'm using for the
"kludge".

Best regards,

James Kass.

RE: Is there Unicode mail out there?

2001-07-12 Thread Ayers, Mike



> From: Chris Wendt [mailto:[EMAIL PROTECTED]] 

> Replying in the charset of the original message is in my view 
> reasonable
> behavior: the recipient of your reply has the best chance to read the
> message in the encoding the original message was sent. Changing the
> encoding decreases the chance the replyee will be able to read your
> message.

For person-to-person emails, this makes sense.  It does not hold up
for mailing lists, however - it's not necessarily unreasonable behavior, but
the odds of readability for mailing lists are fixed to the character set,
regardless of the character set used in any individual mailing (note that
the Windows Thai character set could not be viewed by many people - changed
to UTF-8, almost everyone could read it).  For this reason, I would really
like to see option controlled behavior (use the current behavior as a
default).


/|/|ike

RE: Is there Unicode mail out there?

2001-07-12 Thread Chris Wendt

In any case, no matter if new message or reply or forward, you can force
OE to use a specific encoding using the Format.Encoding menu. There is
no option to ALWAYS use a specific encoding in replies and forwards, you
will have to choose manually each time. OE itself has no option to
automatically determine the best outbound encoding (and I agree that
generally the encoding with the smallest repertoire is the best). OE
will only suggest UTF-8 and will not suggest any other charset, if the
chosen encoding does not hold the characters used.

Note: an HTML message to an HTML4 capable recipient will transport any
character regardless of the chosen encoding. That might explain the
different results you are seeing when sending to differently enabled
recipients.

Replying in the charset of the original message is in my view reasonable
behavior: the recipient of your reply has the best chance to read the
message in the encoding the original message was sent. Changing the
encoding decreases the chance the replyee will be able to read your
message.

-Original Message-
From: James Kass [mailto:[EMAIL PROTECTED]] 
Sent: Thursday, July 12, 2001 1:18 PM
To: Jungshik Shin
Cc: Unicode List
Subject: Re: Is there Unicode mail out there?

Jungshik Shin wrote:

>   Perhaps, both Mozilla/Netscape 6 and MS OE should have an option (
> 'toggle-switchable') to let users  specify that their preferred 
> encoding (set in preference) be used by default regardless of the 
> encoding of messages they're replying to.
>

It would be nice...

MS OE appeared to already have the option.  Under Tools-Options- Send,
there's a check-box for "Reply to messages using the format in which
they were sent".  Under Tools-Options-Send-International Settings,
there's a provision for the user to choose a default encoding and a
check-box to "Use the following default encoding for outgoing
messages:".  Even though this system was set up accordingly, outgoing
messages which were replies to messages in non-UTF-8 encodings weren't
being sent in UTF-8, to my surprise, chagrin, and dismay.

Best regards,

James Kass.

Re: Is there Unicode mail out there?

2001-07-12 Thread James Kass

Jungshik Shin wrote:

>   Perhaps, both Mozilla/Netscape 6 and MS OE should have an option (
> 'toggle-switchable') to let users  specify that their preferred encoding
> (set in preference) be used by default regardless of the encoding of
> messages they're replying to.
>

It would be nice...

MS OE appeared to already have the option.  Under Tools-Options-
Send, there's a check-box for "Reply to messages using the format
in which they were sent".  Under Tools-Options-Send-International
Settings, there's a provision for the user to choose a default
encoding and a check-box to "Use the following default encoding for
outgoing messages:".  Even though this system was set up
accordingly, outgoing messages which were replies to messages
in non-UTF-8 encodings weren't being sent in UTF-8, to my
surprise, chagrin, and dismay.

Best regards,

James Kass.
â€‹

Re: Is there Unicode mail out there?

2001-07-12 Thread Peter_Constable

On 07/12/2001 12:39:30 PM Jungshik Shin wrote:

>   Finally, you succeeded ! Congratulations :-). Could you
>explain what you did differently this time so that other Lotus
>Notes users can benefit from your experience/experiment?

In my first attempts, I had had my Multilingual Internet Mail preference
set to "Use Unicode and prompt". (I had had it like that for some time, and
it never once prompted me, so I don't know what it's supposed to mean.)
This was the setting I had used for the first Thai sample. For the second
Thai sample, I tried changing the preference to "Use Unicode (UTF-8)". That
still didn't force UTF-8. The last time, I simply followed up on a
suggestion sent to me offline by James Kass: add a character that wouldn't
be in another codepage / charset -- I added a ZWSP to my signature. That
did it.

- Peter

---
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: <[EMAIL PROTECTED]>

RE: Is there Unicode mail out there?

2001-07-12 Thread Ayers, Mike



> From: Jungshik Shin [mailto:[EMAIL PROTECTED]] 

>   Mysterious is why this prompting (by MS OE) did not happen to Mike
> Ayers when he replied to Peter's message with Thai string in 
> Windows-874
> adding some Chinese characters while MS OE (5.50.x) I 
> tried certainly
> prompted me to pick one of three (1. send as Unicode, 2. send as is -
> in Windows-874 - risking loss of info. 3. cancel) when I did the same
> thing. ZWS and Chinese characters have no reason to be 
> treated differently
> when added to a Windows-874 encoded message.

Not mysterious really, I'm using Outlook, not Outlook Express.
Despite the similarity of names, the differences seem to be considerable.
It is disturbing, though, that the premium product has less desireable
behavior than the free one in this case.


/|/|ike

Re: Is there Unicode mail out there?

2001-07-12 Thread Jungshik Shin

On Thu, 12 Jul 2001 [EMAIL PROTECTED] wrote:

> >  Hmm, it didn't work either.
> OK, one more try -- Thai test, take 3: กลัปมาอยู่แล้ว

   Finally, you succeeded ! Congratulations :-). Could you
explain what you did differently this time so that other Lotus
Notes users can benefit from your experience/experiment?

  Jungshik Shin

Re: Is there Unicode mail out there?

2001-07-12 Thread

My other e-mail was a real "moji-baka", I'd say. That would be a good term, 
$BJ8;zGO;
$B08@h(B: [EMAIL PROTECTED];
Cc: 
$BF|;~(B: 01/07/12 15:51
$B7oL>(B: Re: Is there Unicode mail out there?

>(I didnt read all the thread so maybe I missed a step).
>
>So the proposal is that minimizing the charset is a good thing?
>
>This means that you and I start out in a conversation about a
>product I am trying to sell you, it happens to be all in ascii
>and we exchange several mails successfully. Then I quote you
>a price in Euros and my 1252 message gets corrupted by your
>reader which can handle either only 8859-1 or ASCII, and
>you miss the fact that the Euro is corrupted and think we
>are talking dollars or some other currency.
>
>Although I understand why you would want a minimal charset in order
>to not needlessly prevent communications, the implication of
>reliability and trust that is built by having some success is
>a problem. You think you are communicating successfully but when it
>is critical it may not...
>
>Perhaps if a harder line was taken when characters
>are used that cannot be converted, this would make more sense.
>(ie give a very clear recognizable indication of corruption or
>conversion failures)
>
>tex
>
>
>
>[EMAIL PROTECTED] wrote:
>> 
>> In a message dated 2001-07-11 15:03:27 Pacific Daylight Time,
>> [EMAIL PROTECTED] writes:
>> 
>> >  One exception to this should be US-ASCII because not only the repertoire
>> >  of US-ASCII is a subset of the repertoire of UTF-8 but also the
>> >  representation of all characters in US-ASCII is identical in UTF-8.
>> >  A smart mail client would notice that all characters
>> >  are in US-ASCII repertoire  and label outgoing messages as in
>> >  US-ASCII EVEN if it's configured to label outgoing messages
>> >  in UTF-8
>> [...]
>> 
>> I thought this might even be enshrined in an RFC.  It certainly makes sense.
>> If you are using a mailer that sends CP1252 down the wire (not that this is a
>> good idea, but some mailers do this), the mailer should examine the message
>> and if it only contains US-ASCII characters, the message should be tagged as
>> US-ASCII.  Otherwise, if it only contains ISO 8859-1, it should be tagged as
>> ISO 8859-1.  Only if it actually contains CP1252 characters, like smart
>> quotes or long dashes, should it be tagged as CP1252.  As Jungshik observed,
>> the same goes for UTF-8.
>> 
>> -Doug Ewell
>>  Fullerton, California
>
>-- 
>---
>Tex Texin  Director, International Business
>mailto:[EMAIL PROTECTED]  +1-781-280-4271
>Fax:+1-781-280-4655
>the Progress Company   14 Oak Park, Bedford, MA 01730
>---
>
>

Re: Is there Unicode mail out there?

2001-07-12 Thread Tex Texin

(I didnt read all the thread so maybe I missed a step).

So the proposal is that minimizing the charset is a good thing?

This means that you and I start out in a conversation about a
product I am trying to sell you, it happens to be all in ascii
and we exchange several mails successfully. Then I quote you
a price in Euros and my 1252 message gets corrupted by your
reader which can handle either only 8859-1 or ASCII, and
you miss the fact that the Euro is corrupted and think we
are talking dollars or some other currency.

Although I understand why you would want a minimal charset in order
to not needlessly prevent communications, the implication of
reliability and trust that is built by having some success is
a problem. You think you are communicating successfully but when it
is critical it may not...

Perhaps if a harder line was taken when characters
are used that cannot be converted, this would make more sense.
(ie give a very clear recognizable indication of corruption or
conversion failures)

tex

[EMAIL PROTECTED] wrote:
> 
> In a message dated 2001-07-11 15:03:27 Pacific Daylight Time,
> [EMAIL PROTECTED] writes:
> 
> >  One exception to this should be US-ASCII because not only the repertoire
> >  of US-ASCII is a subset of the repertoire of UTF-8 but also the
> >  representation of all characters in US-ASCII is identical in UTF-8.
> >  A smart mail client would notice that all characters
> >  are in US-ASCII repertoire  and label outgoing messages as in
> >  US-ASCII EVEN if it's configured to label outgoing messages
> >  in UTF-8
> [...]
> 
> I thought this might even be enshrined in an RFC.  It certainly makes sense.
> If you are using a mailer that sends CP1252 down the wire (not that this is a
> good idea, but some mailers do this), the mailer should examine the message
> and if it only contains US-ASCII characters, the message should be tagged as
> US-ASCII.  Otherwise, if it only contains ISO 8859-1, it should be tagged as
> ISO 8859-1.  Only if it actually contains CP1252 characters, like smart
> quotes or long dashes, should it be tagged as CP1252.  As Jungshik observed,
> the same goes for UTF-8.
> 
> -Doug Ewell
>  Fullerton, California

-- 
---
Tex Texin  Director, International Business
mailto:[EMAIL PROTECTED]  +1-781-280-4271
Fax:+1-781-280-4655
the Progress Company   14 Oak Park, Bedford, MA 01730
---

Re: Is there Unicode mail out there?

2001-07-12 Thread Peter_Constable



>  Hmm, it didn't work either.
OK, one more try -- Thai test, take 3: กลัปมาอยู่แล้ว


- Peter


---
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: <[EMAIL PROTECTED]>

Re: Is there Unicode mail out there?

2001-07-12 Thread Jungshik Shin

On Thu, 12 Jul 2001, James Kass wrote:

> Here's a work-around that seems to work.
>
> Added the ZWS after the signature in a signature file.
> Because the mojibake for ZWS includes the Euro
> currency symbol, OE prompts to 'send as Unicode'
> when replying to a non-UTF-8 sender.

  Mysterious is why this prompting (by MS OE) did not happen to Mike
Ayers when he replied to Peter's message with Thai string in Windows-874
adding some Chinese characters while MS OE (5.50.x) I tried certainly
prompted me to pick one of three (1. send as Unicode, 2. send as is -
in Windows-874 - risking loss of info. 3. cancel) when I did the same
thing. ZWS and Chinese characters have no reason to be treated differently
when added to a Windows-874 encoded message.

  BTW, Mozilla/Netscape 6 also uses the encoding of the message
(or its closest match among IANA-registered MIME charsets. Thus, in place
of Windows-874, Mozilla/Netscape 6 uses TIS-620) you're replying to by
default. When one adds some characters outside the repertoire of that
encoding, it warns that there are some characters not representable in the
current encoding and it's necessary to change the encoding to something
that can represent all characters. (it does not suggest Unicode.) It
offers two options : go ahead despite potential loss of some characters
or cancel and change the encoding.

  Perhaps, both Mozilla/Netscape 6 and MS OE should have an option (
'toggle-switchable') to let users  specify that their preferred encoding
(set in preference) be used by default regardless of the encoding of
messages they're replying to.

   Jungshik Shin

Re: Is there Unicode mail out there?

2001-07-12 Thread James Kass


Here's a work-around that seems to work.

Added the ZWS after the signature in a signature file.
Because the mojibake for ZWS includes the Euro
currency symbol, OE prompts to 'send as Unicode'
when replying to a non-UTF-8 sender.

Of course, the time saved by not having to manually
change the encoding will probably be less than the
time lost explaining what the junk is under my name.

Best regards,

James Kass.
â€‹

Re: Is there Unicode mail out there?

2001-07-12 Thread James

[EMAIL PROTECTED] wrote:
> 
> In a message dated 2001-07-11 15:03:27 Pacific Daylight Time,
> [EMAIL PROTECTED] writes:
> 
> >  One exception to this should be US-ASCII because not only the repertoire
> >  of US-ASCII is a subset of the repertoire of UTF-8 but also the
> >  representation of all characters in US-ASCII is identical in UTF-8.
> >  A smart mail client would notice that all characters
> >  are in US-ASCII repertoire  and label outgoing messages as in
> >  US-ASCII EVEN if it's configured to label outgoing messages
> >  in UTF-8
> [...]
> 
> I thought this might even be enshrined in an RFC.  It certainly makes sense.
> If you are using a mailer that sends CP1252 down the wire (not that this is a
> good idea, but some mailers do this), the mailer should examine the message
> and if it only contains US-ASCII characters, the message should be tagged as
> US-ASCII. 

The RFCs/BCPs do encourage using as minimal a charset as possible.

Anyway, UTF-8 email is nowhere right now. Kat Momoi of Netscape has suggested
that about the only this could change is if email client vendors turn it
on by default in new product releases. I won't be the first!

Having done a lot of email client programming using the RFCs as a basis,
let me say that in general RFCs are vague, and not always the best practice
for interoperability when it comes to email.

For example, CRLF in message bodies is recommended, but actually reduces
interoperability, particularly with subversions of IE 5. So I don't know
of any email client that does it. And quoted-printable is way too
complicated to expect conforming implementations.

And don't get me started about all the random charsets that RFCs promote that
nobody adopts!

James.

Re: Is there Unicode mail out there?

2001-07-11 Thread James Kass


Please disregard my previous message about a work-around
for Outlook Express problem.

Although it works, non-UTF-8 messages are no longer being
properly displayed, an unacceptable trade-off.

Another possibility which was tested was to add an innocuous
character which isn't included in any code page to the
signature.  Tried the zero-width space.  When copying the
zero-width space into the signature of a message being sent
in reply to a message encoded as "Thai (Windows)", Outlook
Express prompted to "Send as Unicode..." when the letter
was tagged to be sent later.  So far, so good.

Figured it would be possible to set up a "signature" with
ZWS to eliminate the necessity of manually changing the 
encoding of messages being sent to UTF-8 every time a 
message is sent.  Unfortunately, on Windows M.E., the 
signature  information is stored in the Registry, and it's ASCII.  
So, the ZWS got converted to a question mark and doesn't
get switched back when it's added to a message.

So, tried setting up a signature file to be added to each
outgoing message including the ZWS.  In this case, MSOE
displays the UTF-8 ZWS as "mojibake" (gibberish) when the
signature is added to the outgoing message.

Perhaps a future version of Outlook will correct the
problem.

Best regards,

James Kass.

Re: Is there Unicode mail out there?

2001-07-11 Thread DougEwell2

In a message dated 2001-07-11 15:03:27 Pacific Daylight Time, 
[EMAIL PROTECTED] writes:

>  One exception to this should be US-ASCII because not only the repertoire
>  of US-ASCII is a subset of the repertoire of UTF-8 but also the
>  representation of all characters in US-ASCII is identical in UTF-8.
>  A smart mail client would notice that all characters
>  are in US-ASCII repertoire  and label outgoing messages as in
>  US-ASCII EVEN if it's configured to label outgoing messages
>  in UTF-8
[...]

I thought this might even be enshrined in an RFC.  It certainly makes sense.  
If you are using a mailer that sends CP1252 down the wire (not that this is a 
good idea, but some mailers do this), the mailer should examine the message 
and if it only contains US-ASCII characters, the message should be tagged as 
US-ASCII.  Otherwise, if it only contains ISO 8859-1, it should be tagged as 
ISO 8859-1.  Only if it actually contains CP1252 characters, like smart 
quotes or long dashes, should it be tagged as CP1252.  As Jungshik observed, 
the same goes for UTF-8.

-Doug Ewell
 Fullerton, California

Re: Is there Unicode mail out there?

2001-07-11 Thread Jungshik Shin

On Thu, 12 Jul 2001 [EMAIL PROTECTED] wrote:

> In a message dated 2001-07-11 15:03:27 Pacific Daylight Time,
> [EMAIL PROTECTED] writes:
>
> >  One exception to this should be US-ASCII because not only the repertoire
> >  of US-ASCII is a subset of the repertoire of UTF-8 but also the
> >  representation of all characters in US-ASCII is identical in UTF-8.
> >  A smart mail client would notice that all characters
> >  are in US-ASCII repertoire  and label outgoing messages as in
> >  US-ASCII EVEN if it's configured to label outgoing messages
> >  in UTF-8

> I thought this might even be enshrined in an RFC.  It certainly makes sense.
> If you are using a mailer that sends CP1252 down the wire (not that this is a
> good idea, but some mailers do this), the mailer should examine the message
> and if it only contains US-ASCII characters, the message should be tagged as
> US-ASCII.  Otherwise, if it only contains ISO 8859-1, it should be tagged as
> ISO 8859-1.  Only if it actually contains CP1252 characters, like smart
> quotes or long dashes, should it be tagged as CP1252.  As Jungshik observed,
> the same goes for UTF-8.

  I can't say it better than you did ! While focusing on
UTF-8, I forgot to mention the case involving Windows-125x, ISO-8859-x
and US-ASCII.

  BTW, some broken/MIME-ignorant mail clients (e.g. Eudora for MS-Windows)
do sorta the opposite. They mislabel outgoing messages as in ISO 8859-1
while they include characters like smart quotes and long dashes. The
best would be to warn users that their messages contain those characters
outside their preferred encoding and to offer a couple of options to
choose from (use Unicode or other wider encodings or 'transliterate'
those characters with those in the repertoire of user's preferred
encoding). Short of that, at least it should label it correctly (not
that I'm in favor of sending out Windows-1252 down the wire.)

   Jungshik Shin

RE: Is there Unicode mail out there?

2001-07-11 Thread Jungshik Shin

On Wed, 11 Jul 2001, Ayers, Mike wrote:

>   One last time:
>
> > > From: Mark Davis [mailto:[EMAIL PROTECTED]]
> > >
> > > Yes, that works fine. The Thai comes through clearly: 
>กลัปมาอยู่แล้ว
> > >
>
>   Woohoo!!! UTF-8 party!!!  大家好!!!

  Congratulations ^-^ ! This time you clearly made it with both Thai
and Chinese characters intact in UTF-8. Because either you manually
change the encoding to UTF-8 in the composition window (although you're
replying to a message in Windows-874) or you were replying to a message
encoded in UTF-8.

   Jungshik Shin

Re: Is there Unicode mail out there?

2001-07-11 Thread Jungshik Shin

On Wed, 11 Jul 2001 [EMAIL PROTECTED] wrote:

> In a message dated 2001-07-11 15:03:27 Pacific Daylight Time,
> [EMAIL PROTECTED] writes:
>
> >  P.S.How about making a sort of resolution to recommend that anybody
> >  writing to this list  should use UTF-8   *if /when* possible?
> >  This was suggested in the past, but we're still getting
> >  a lot of messages in ISO-8859-1 and other encodings.

  Just in case,  I didn't mean to suggest an 'resolution' to force
everyone to use UTF-8. I just wanted to suggest that a gentle and friendly
recommendation be made as to the encoding to use for this list.

> Believe me, I would if I could.

  Apparently, you're using CompuServe. I'm not sure if it's possible
to use a mail client other than one included in CompuServe 'client/browser/
whatever'.

> MIME-Version: 1.0
> Content-Type: text/plain; charset="US-ASCII"
> Content-Transfer-Encoding: 7bit
> X-Mailer: CompuServe 2000 32-bit sub 113

  If what I heard is correct, it's possible to use an external mail (IMAP4
or POP3) client like Netscape 6/Mozilla and MS OE to access mail folders
in CompuServe. I also heard that unlike AOL (although CompuServe and
AOL are now affiliated) CompuServe has SMTP servers for subscribers to
use for outgoing messages. If all I said is true, I'm wondering why you
don't switch to one of 'external' mail clients I mentioned to compose
your message in UTF-8. Perhaps, what I heard is not the case and that's
why you can't do it. There is still an option, though, namely switching
your ISP :-) (perhaps, that's not a viable option for some reason)

   Jungshik Shin

RE: Is there Unicode mail out there?

2001-07-11 Thread Jungshik Shin

On Wed, 11 Jul 2001, Ayers, Mike wrote:

> > From: Jungshik Shin [mailto:[EMAIL PROTECTED]]
>
> >   Nothing cryptic. As with others on this thread, your problem is
> > to mistake Windows-874 (legacy encoding for Thai) for UTF-8. Because
> > Windows-874 does NOT cover Chinese characters, they turned into
> > '?'. Judging from your message hader, you're not using MS OE
> > but something different.
>
>   I am using OE, set to UTF-8.  If I mail Chinese to myself, all is
> well.

  Sure. It should be in UTF-8 and all Chinese characters
(supported by Unicode and thus UTF-8) should be well and alive because
you're NOT replying to any message with the encoding other than UTF-8,
but you're just composing a message *anew* (as opposed to replying to
someone's mesg in an encoding other than UTF-8)  and you configured MS
OE to use UTF-8 as the default encoding for outgoing messages.

> > > X-Mailer: Internet Mail Service (5.5.2653.19)
> > > Content-Type: text/plain; charset="windows-874"
> > > Content-Transfer-Encoding: 8bit
>
>   Odd.  Perhaps our post office is changing things.

 Or, your version of MS OE (included in beta version of Windows XP)
might use a little bit different 'signature' than used by MS OE 5.x.

> >   No, it should have been Windows-874 party !! :-).
> > Both Mark Davis and Peter Constable sent   messages in Windows-874
> > beleiving that they're using UTF-8.
>
>   Perhaps, like me, they sent messages in UTF-8 and had them converted
> to Windows-874 without consent.  :-(

  Yes, it can be looked upon as that way.
As I wrote before (perhaps not so clearly as I wished) and Addison
explained well in what you quoted below, MS Outlook Express uses
the encoding of the message you're replying to (incoming message :
Windows-874) in your reply  (i.e. outgoing message). In this
case, you were replying to Peter's message encoded in and labeled as
Windows-874. Therefore, the encoding of your outgoing message is set
to Windows-874 *regardless of* what you set as the default encoding for
your outgoing message(UTF-8 or whatever).  Consequently when you added
some  Chinese characters, they turned into '?'  as there's no room for
Chinese characters in Windows-874.

> >However, I'm sending this in UTF-8 (after automatic conversion by
> > my mail client, Pine 4.33).

>   I also received it as UTF-8.

  Pine doesn't do any trick, but just follows what I told it to :-).

> 
> I think you'll find that Peter's response applies to you too: the mailer is
> seeing Windows-874　on the incoming message and converting your outgoing
> message to use that same encoding (in a bid to be compatible with the
> original message). Outlook has done that for awhile. If you manually set the
> encoding for the reply you can override that behavior. In Outlook 2000 this
> is "Format | Encoding"
> 

  That's also the case in MS OE 5.x. You can override this behavior
(using the same encoding for your reply as used by the mesage you're
replying to regardless of your configured default encoding)  in 'Format |
Encoding' in the message composition window.  In addition, when you go
to 'Format | Encoding' in the msg composition window, you can see what
encoding is being used for the message you're composing now. The bullet
point is to the left of the encoding used.

  BTW, what's strange is that you didn't get prompted when you
added some Chinese characters to your reply to Peter's message with Thai
characters encoded in Windows-874. My version of MS OE (5.50.4522.1200)
asks me to pick one of the following three options when I do that:

  - Send As Unicode: The msg will be sent as a Unicode message.
  All char. set info. will be retained... some mail client
  may not be able to deal with Unicode msg

  - Send As Is : The msg. will be sent as a regular email msg using
 only the default char. set. Any text not in
 the default char. set may be unreadable by the
 receipient.

(Note, in this case,
 that it's not what you set as the default for
 your outgoing message but the encoding of
 the message you're replying to : Windows-874)

  - Cancel: return to.

  If you (and Mark who uses a little bit old version
of MS OE : Microsoft Outlook Express 5.00.3018.1300) had been prompted
this way, you would have picked the first option to avoid sending your
messages in Windows-874 (without your knowledge).

   Jungshik Shin

Re: Is there Unicode mail out there?

2001-07-11 Thread Jungshik Shin

On Wed, 11 Jul 2001 [EMAIL PROTECTED] wrote:

> >  Now the question is whether it's possible to force Lotus Notes
> >to use UTF-8 as  the encoding of the outgoing message  EVEN WHEN
> >characters in the message are all covered by   existing
> >encoding other than UTF-8 (e.g. Windows-874 for Thai).
>
> Well, I'm going to try one more thing -- Thai test, take 2: 
>กลัปมาอยู่แล้ว

  Hmm, it didn't work either. Even though you're replying to my
message in UTF-8 (and clearly labeled as such in Content-Type header)
with some Ethiopian characters (which were removed in your reply), Lotus
Notes silently (without your consent) fell back to Windows-874 when you
added some Thai characters in your reply. You may have to do some more
digging to find an option/switch buried deep inside  to make Lotus Notes
use UTF-8 no matter what (or when you want) (instead of using
the 'smallest??' encoding that covers all characters in your
outgoing messages). It seems like Lotus Notes is too 'smart'..

   Jungshik Shin

Re: Is there Unicode mail out there?

2001-07-11 Thread James Kass

Mike Ayers wrote:

>
> Okay, I sent these as UTF-8, with some Chinese 
> where the question marks are.  However, the 
> Chinese is getting eaten somewhere along the way.
> Oddly, though, the Thai still displays fine.  Would 
> any Outlook XP guru volunteer to help me get back 
> to my international ways?
>
> Final test:  

On Outlook Express 5
[Tools] - [Options] - [Read] - [Fonts] -
(Unicode) - {Select appropriate fonts} - {Set as Default}

- then -
[Tools] - [Options] - [Read] - [International Settings] -
{Check the box marked 'Use default encoding for all...'}

This seems to work-around the distressing practice of the
program automatically replying to senders in the sender's 
default rather than the user's preference.

Possibly there are other settings under the [Send] and/or
[Compose] tabs that might also have to be adjusted.  On this
system, the 'reply to senders using the senders format' field
was unchecked, yet my replies to earlier message in the thread
were being sent as "Thai (Windows)".

Best regards,

James Kass.

Re: Is there Unicode mail out there?

2001-07-11 Thread DougEwell2


In a message dated 2001-07-11 15:03:27 Pacific Daylight Time, 
[EMAIL PROTECTED] writes:

>  P.S.How about making a sort of resolution to recommend that anybody
>  writing to this list  should use UTF-8   *if /when* possible?
>  This was suggested in the past, but we're still getting
>  a lot of messages in ISO-8859-1 and other encodings.

Believe me, I would if I could.

-Doug Ewell
 Fullerton, California

RE: Is there Unicode mail out there?

2001-07-11 Thread Ayers, Mike



One last time:

> > From: Mark Davis [mailto:[EMAIL PROTECTED]] 
> > 
> > Yes, that works fine. The Thai comes through clearly: 
>กลัปมาอยู่แล้ว
> > 

Woohoo!!! UTF-8 party!!!  大家好!!!


/|/|ike

RE: Is there Unicode mail out there?

2001-07-11 Thread Ayers, Mike



> From: Jungshik Shin [mailto:[EMAIL PROTECTED]] 

>   Nothing cryptic. As with others on this thread, your problem is
> to mistake Windows-874 (legacy encoding for Thai) for UTF-8. Because
> Windows-874 does NOT cover Chinese characters, they turned into
> '?'. Judging from your message hader, you're not using MS OE
> but something different.

I am using OE, set to UTF-8.  If I mail Chinese to myself, all is
well.

> > X-Mailer: Internet Mail Service (5.5.2653.19)
> > Content-Type: text/plain; charset="windows-874"
> > Content-Transfer-Encoding: 8bit

Odd.  Perhaps our post office is changing things.

>   No, it should have been Windows-874 party !! :-).
> Both Mark Davis and Peter Constable sent   messages in Windows-874
> beleiving that they're using UTF-8.

Perhaps, like me, they sent messages in UTF-8 and had them converted
to Windows-874 without consent.  :-(

>However, I'm sending this in UTF-8 (after automatic conversion by
> my mail client, Pine 4.33).

I also received it as UTF-8.


I think you'll find that Peter's response applies to you too: the mailer is
seeing Windows-874　on the incoming message and converting your outgoing
message to use that same encoding (in a bid to be compatible with the
original message). Outlook has done that for awhile. If you manually set the
encoding for the reply you can override that behavior. In Outlook 2000 this
is "Format | Encoding"


Mine already says UTF-8.  Test again: 你好吗？


/|/|ike

Re: Is there Unicode mail out there?

2001-07-11 Thread Jungshik Shin

On Wed, 11 Jul 2001 [EMAIL PROTECTED] wrote:

> >> Unicode (UTF-8)". Just as a test, here's a bit of Thai: ¡Ƒ»΂ڨ↩ō
> >  Your mail has the following header, which indicates that
> >it's in 'Windows-874' encoding. I'm not sure whether that encoding name
> >is registered with IANA for use in MIME.
> >
> >> X-Mailer: Lotus Notes Release 5.0.5  September 22, 2000
> >> MIME-Version: 1.0
> >> Content-type: text/plain; charset=Windows-874
>
> OK, I didn't look closely at the header, just at the result. Here's another
> test that will be telling - I don't know of any codepage / charset for
> Ethiopic: ሀሁሂሃ

   Yes, this time you made it :-)

> X-Mailer: Lotus Notes Release 5.0.5  September 22, 2000
> MIME-Version: 1.0
> Content-type: text/plain; charset=UTF-8

  Now the question is whether it's possible to force Lotus Notes
to use UTF-8 as  the encoding of the outgoing message  EVEN WHEN
characters in the message are all covered by   existing
encoding other than UTF-8 (e.g. Windows-874 for Thai).

 One exception to this should be US-ASCII because not only the repertoire
of US-ASCII is a subset of the repertoire of UTF-8 but also the
representation of all characters in US-ASCII is identical in UTF-8.
A smart mail client would notice that all characters
are in US-ASCII repertoire  and label outgoing messages as in
US-ASCII EVEN if it's configured to label outgoing messages
in UTF-8 (or any   superset of US-ASCII like EUC-KR, ISO-2022-JP,
GB2312-80 - a better term is certainly EUC-CN but it's not
registered with IANA and GB2312-80  got too widely-spread beyond
remedy-,  ISO8859-[1-9,15]).  There's no violation of standards
in NOT doing this, but doing this would for sure reduce
the possibility of unnecessary 'red-flag' raised by some  mail clients on
the recipient's side. Unfortunately, MS OE and Netscape-Mail
are not smart in this regard while Pine and Mutt are.

  Jungshik Shin

P.S.How about making a sort of resolution to recommend that anybody
writing to this list  should use UTF-8   *if /when* possible?
This was suggested in the past, but we're still getting
a lot of messages in ISO-8859-1 and other encodings.

Re: Is there Unicode mail out there?

2001-07-11 Thread Peter_Constable



>  Now the question is whether it's possible to force Lotus Notes
>to use UTF-8 as  the encoding of the outgoing message  EVEN WHEN
>characters in the message are all covered by   existing
>encoding other than UTF-8 (e.g. Windows-874 for Thai).

Well, I'm going to try one more thing -- Thai test, take 2: ¡ÅÑ»ÁÒÍÂÙèáÅéÇ


- Peter


---
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: <[EMAIL PROTECTED]>

RE: Is there Unicode mail out there?

2001-07-11 Thread Addison Phillips [wM]


I think you'll find that Peter's response applies to you too: the mailer is seeing 
Windows-874　on the incoming message and converting your outgoing message to use that 
same encoding (in a bid to be compatible with the original message). Outlook has done 
that for awhile. If you manually set the encoding for the reply you can override that 
behavior. In Outlook 2000 this is "Format | Encoding"

Best　Regards,

Addison

> -Original Message-
> From: [EMAIL PROTECTED] 
> [mailto:[EMAIL PROTECTED]]On Behalf Of Ayers, Mike
> Sent: Wednesday, July 11, 2001 12:42 PM
> To: Unicode List
> Subject: RE: Is there Unicode mail out there?
> 
> 
> 
>   Okay, I sent these as UTF-8, with some Chinese where 
> the question
> marks are.  However, the Chinese is getting eaten somewhere 
> along the way.
> Oddly, though, the Thai still displays fine.  Would any 
> Outlook XP guru
> volunteer to help me get back to my international ways?
> 
>   Final test:  
> 
> 
> > From: Ayers, Mike [mailto:[EMAIL PROTECTED]] 
> > 
> > Let's try this again...
> > 
> > > > From: Mark Davis [mailto:[EMAIL PROTECTED]] 
> > > > 
> > > > Yes, that works fine. The Thai comes through clearly: 
> > กลัปมาอยู่แล้ว
> > > > 
> > 
> > Woohoo!!!  UTF-8 party!!!  ???!!!
> > 
> > > 
> > > /|/|ike
> > > 
> > 
> 
>

Re: Is there Unicode mail out there?

2001-07-11 Thread DougEwell2

In a message dated 2001-07-11 13:31:54 Pacific Daylight Time, 
[EMAIL PROTECTED] writes:

>  OK, I didn't look closely at the header, just at the result. Here's another
>  test that will be telling - I don't know of any codepage / charset for
>  Ethiopic: áˆ€áˆáˆ‚áˆƒ

Everything came out fine.  Of course, what I saw was the raw bytes, 
interpreted as CP1252, but I just cut and pasted them into SC UniPad and 
everything came out fine (except for the fact that UniPad doesn't have 
Ethiopic glyphs yet...).

The header revealed the encoding Peter used:

>  Content-type: text/plain; charset=UTF-8

-Doug Ewell
 Fullerton, California

RE: Is there Unicode mail out there?

2001-07-11 Thread Jungshik Shin

On Wed, 11 Jul 2001, Ayers, Mike wrote:

>   Okay, I sent these as UTF-8, with some Chinese where the question
> marks are.  However, the Chinese is getting eaten somewhere along the way.
> Oddly, though, the Thai still displays fine.  Would any Outlook XP guru
> volunteer to help me get back to my international ways?
>
>   Final test:  

  Nothing cryptic. As with others on this thread, your problem is
to mistake Windows-874 (legacy encoding for Thai) for UTF-8. Because
Windows-874 does NOT cover Chinese characters, they turned into
'?'. Judging from your message hader, you're not using MS OE
but something different.

> X-Mailer: Internet Mail Service (5.5.2653.19)
> Content-Type: text/plain; charset="windows-874"
> Content-Transfer-Encoding: 8bit

   MS OE 5.x is smart enough to detect characters (in your
reply. in this case  Chinese characters) not covered by the repertoire
of MIME charrset (in this case, Windows-874) of  the message you're
replying to (by default, whch is also the MIME charset of your reply)
and to prompt users to answer whether to use UTF-8 or not explaining
that some of characters are not representable in the default encoding
(the encoding of the message you're replying to) and will be lost.
You can also configure MS OE  to always use UTF-8 (or whatever
encoding of your choice) regardless of the encoding of
messages you're replying to.

> > From: Ayers, Mike [mailto:[EMAIL PROTECTED]]
> >
> > Let's try this again...
> >
> > > > From: Mark Davis [mailto:[EMAIL PROTECTED]]
> > > >
> > > > Yes, that works fine. The Thai comes through clearly:
> > กลัปมาอยู่แล้ว
> > > >
> >
> > Woohoo!!!  UTF-8 party!!!  ???!!!

  No, it should have been Windows-874 party !! :-).
Both Mark Davis and Peter Constable sent   messages in Windows-874
beleiving that they're using UTF-8.

   However, I'm sending this in UTF-8 (after automatic conversion by
my mail client, Pine 4.33).

Jungshik Shin

Re: Is there Unicode mail out there?

2001-07-11 Thread Peter_Constable



>> Unicode (UTF-8)". Just as a test, here's a bit of Thai: ¡Ƒ»΂ڨ↩ō
>  Your mail has the following header, which indicates that
>it's in 'Windows-874' encoding. I'm not sure whether that encoding name
>is registered with IANA for use in MIME.
>
>> X-Mailer: Lotus Notes Release 5.0.5  September 22, 2000
>> MIME-Version: 1.0
>> Content-type: text/plain; charset=Windows-874

OK, I didn't look closely at the header, just at the result. Here's another
test that will be telling - I don't know of any codepage / charset for
Ethiopic: ሀሁሂሃ



- Peter


---
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: <[EMAIL PROTECTED]>

RE: Is there Unicode mail out there?

2001-07-11 Thread Ayers, Mike



Let's try this again...

> > From: Mark Davis [mailto:[EMAIL PROTECTED]] 
> > 
> > Yes, that works fine. The Thai comes through clearly: ¡ÅÑ»ÁÒÍÂÙèáÅéÇ
> > 

Woohoo!!!  UTF-8 party!!!  ???!!!

> 
> /|/|ike
>

RE: Is there Unicode mail out there?

2001-07-11 Thread Ayers, Mike



Okay, I sent these as UTF-8, with some Chinese where the question
marks are.  However, the Chinese is getting eaten somewhere along the way.
Oddly, though, the Thai still displays fine.  Would any Outlook XP guru
volunteer to help me get back to my international ways?

Final test:  


> From: Ayers, Mike [mailto:[EMAIL PROTECTED]] 
> 
>   Let's try this again...
> 
> > > From: Mark Davis [mailto:[EMAIL PROTECTED]] 
> > > 
> > > Yes, that works fine. The Thai comes through clearly: 
> ¡ÅÑ»ÁÒÍÂÙèáÅéÇ
> > > 
> 
>   Woohoo!!!  UTF-8 party!!!  ???!!!
> 
> > 
> > /|/|ike
> > 
>

Re: Is there Unicode mail out there?

2001-07-11 Thread Jungshik Shin

On Wed, 11 Jul 2001, Mark Davis wrote:

> - Original Message -
> From: <[EMAIL PROTECTED]>
> Sent: Wednesday, July 11, 2001 09:33

> > Main and News tab, in the Multilingual Internet Mail drop down, select
> "Use
> > Unicode (UTF-8)". Just as a test, here's a bit of Thai: 
>à¸à¸¥à¸±à¸à¸¡à¸²à¸à¸¢à¸¹à¹à¹à¸¥à¹à¸§

> Yes, that works fine. The Thai comes through clearly: ¡ÅÑ»ÁÒÍÂÙèáÅéÇ

  Well, it was not in UTF-8, though. It was encoded in Windows-874
(for Thai) and was flagged as such in Content-Type header of the message.

   Conetnt-Type: text/plain; charset=Windows-874

  In my previous response, I thought actual encoding used was UTF-8, but
Lotus Notes  put the incorrect charset parameter value in C-T header. That
turned out not to be the case. At least, there's NO inconsistency between
what's used in the message body and what the message header indicated
was used in the message body.

  I'm writing this email with Pine running inside
UTF-8 enabled xterm with the following line added to display filter
spec. of my pinerc (Pine configuration file)

  _CHARSET(Windows-874)_ /usr/bin/iconv -f CP874 -t UTF-8

   Unlike my previous message (which include Windows-874 encoded
string in Thai but marked as in UTF-8 because I thought that Thai string
was in UTF-8), this message should have Thai string encoded in UTF-8
(as indicated by C-T header).

   Jungshik Shin

Re: Is there Unicode mail out there?

2001-07-11 Thread Peter_Constable



>Yes, that works fine. The Thai comes through clearly: ¡ÅÑ»ÁÒÍÂÙèáÅéÇ

And my own message came back to me with the Thai as I originally sent it.
So, I'm getting UTF-8 going out and coming in with nothing messing it up in
between. If other Notes users aren't getting the same results, check the
version of your client (I don't know if R4.x could handle Unicode or not),
and check your preferences.



- Peter


---
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: <[EMAIL PROTECTED]>

1 2 >

1 - 100 of 113 matches

Mail list logo