Re: UTF-16 is not Unicode

2002-02-13 Thread Lars Marius Garshol


* Michael Everson
| 
| I think it's clear that Unicode should give some advice as to how to
| announce encoding options in a useful way to the end user. For the
| two encodings we are discussing, may I suggest the following
| standard menu items:
| 
| Unicode (Raw, UTF-16)
| Unicode (Web, UTF-8)

I don't think calling it "raw" is very good. It just keeps alive the
myth that UTF-16 *is* Unicode. None of the UTFs are "raw", and the
closest must surely be UTF-32.

The below would probably be easier:

  Unicode (UTF-16)
  Unicode (UTF-8, default)

If you know what you're doing you can choose what you want. If not,
you should just choose UTF-8.

-- 
Lars Marius Garshol, Ontopian http://www.ontopia.net >
ISO SC34/WG3, OASIS GeoLang TChttp://www.garshol.priv.no >





RE: UTF-16 is not Unicode

2002-02-13 Thread Marco Cimarosti

David Starner wrote:
> On Tue, Feb 12, 2002 at 08:12:08PM +0100, Marco Cimarosti wrote:
> > OK, UTF-8 is my favorite default UTF too. However, whatever 
> the default is,
> > it is easier to just call it "Unicode", and call the other 
> options "Unicode
> > (something else)".
> > 
> > That puts one less acronym in front of the "naive" user. 
> The expert user is
> > supposed to know what the default UTF is on her platform.
> 
> What happens when a user is told to save in UTF-16? What 
> about when two users running different operating systems
> try to pass files about? And why would Unicode be any
> clearer to a naive user than UTF-16?

I only have a definite answer for the last question: everybody I know who
works on computer know what Unicode is, but I never met out of this mailing
list anyone who was familiar with the acronym "UTF-16"...

By the way, because of its transparent etymology, the name "Unicode" is also
quite self-explanatory. On the other hand, "UTF-16" is just one more
3-letter acronym for one more 16-bit technology, so it could be mistaken
with all kinds of computer-related products.

> IMO, UTF-16 is as clear as Unicode, and more accurate. Being 
> consistent among platforms is a needed plus.

Perhaps "more accurate", but definitely not "as clear as".

_ Marco




Re: UTF-16 is not Unicode

2002-02-13 Thread Michael Everson

At 14:28 -0600 2002-02-12, David Starner wrote:
>
>What happens when a user is told to save in UTF-16? What about when two
>users running different operating systems try to pass files about? And
>why would Unicode be any clearer to a naive user than UTF-16?
>
>IMO, UTF-16 is as clear as Unicode, and more accurate. Being consistent
>among platforms is a needed plus.

Internet Explorer calls Unicode (UTF-8) "Universal Alphabet". Now I 
would say pretty much the same thing to the layman, but the 
distinction between what's on the web (UTF-8) and what might be coded 
elsewhere (UTF-16) should be made.

Apple's TextExit with OS X gives a set of choices for encodings to 
the user in the Open File dialogue:

Western (Mac Roman)
Western (Windows Latin 1)
Japanese (Mac OS)
Japanese (Shift JIS)
Traditional Chinese (Mac OS)
Simplified Chinese (Mac OS)
Korean (Mac OS)
Unicode
UTF-8

Apple's OS 9 WorldText can save as UTF-16 but calls it "Standard Unicode".

Cyclone's Unicode options are four:

Standard (16 bit)
Standard (16 bit) Canonical Decomposition
UTF-7
UTF-8

These distinctions are interesting. It shows that many users are 
expected to know, or find out, about specific encoding differences.

I think it's clear that Unicode should give some advice as to how to 
announce encoding options in a useful way to the end user. For the 
two encodings we are discussing, may I suggest the following standard 
menu items:

Unicode (Raw, UTF-16)
Unicode (Web, UTF-8)
-- 
Michael Everson *** Everson Typography *** http://www.evertype.com




Re: UTF-16 is not Unicode

2002-02-13 Thread Lars Marius Garshol


* Marco Cimarosti
| 
| Only if the user selects a menu like "Manual encoding settings", she
| should be presented with a choice like "International (Unicode)",
| that opposes to "Western (ISO 8859-1)", "Chinese, simplified (GB
| 2312-80)", and so on. All entries should have a generic descriptive
| label together with a precise geek-friendly label in parenthesis.

This is what Mozilla does, and I seem to recall that IE does the
same. (I can't check here, being Linux-only at home.)

Opera has an encoding menu divided into Unicode / Western European /
Central European / ... / Chinese / Japanese / Korean, where the
choices on the next level are UTF-8, UTF-16 / ISO 8859-1, ISO 8859-15,
Windows-1252 / ...

Seems to work pretty well.

-- 
Lars Marius Garshol, Ontopian http://www.ontopia.net >
ISO SC34/WG3, OASIS GeoLang TChttp://www.garshol.priv.no >





RE: UTF-16 is not Unicode

2002-02-12 Thread Yves Arrouye

> A ideal interface should probably automatically and silently select
> Unicode
> (and its default UTF) whenever one or more of the characters in a document
> are not representable in the local encoding.

I beg to differ. Silently doing such an unexpected change is guaranteed to
confuse the user, especially as she starts exchanging the files or loading
in other programs. The interface should warn the user and offer a couple
sensible choices, one of them (and maybe the default) being to save using
one of the UTFs.

YA





Re: UTF-16 is not Unicode

2002-02-12 Thread David Starner

On Tue, Feb 12, 2002 at 08:12:08PM +0100, Marco Cimarosti wrote:
> OK, UTF-8 is my favorite default UTF too. However, whatever the default is,
> it is easier to just call it "Unicode", and call the other options "Unicode
> (something else)".
> 
> That puts one less acronym in front of the "naive" user. The expert user is
> supposed to know what the default UTF is on her platform.

What happens when a user is told to save in UTF-16? What about when two
users running different operating systems try to pass files about? And
why would Unicode be any clearer to a naive user than UTF-16?

IMO, UTF-16 is as clear as Unicode, and more accurate. Being consistent
among platforms is a needed plus.

-- 
David Starner / Давид Старнзр - [EMAIL PROTECTED]
Pointless website: http://dvdeug.dhis.org
What we've got is a blue-light special on truth. It's the hottest thing 
with the youth. -- Information Society, "Peace and Love, Inc."




RE: UTF-16 is not Unicode

2002-02-12 Thread Marco Cimarosti

David Starner wrote:
> On Tue, Feb 12, 2002 at 11:22:01AM +0100, Marco Cimarosti wrote:
> > At best, the localization could use a label such as 
> "Unicode (UTF-8)" to
> > enforce the concept that UTF-8 is Unicode as well. But it 
> could hardly use
> > "Unicode (UTF-16BE)" for the *default* UTF, because the 
> user would ask
> > "Where is *plain* 'Unicode'?"
> 
> Then possibly the user should be educated. This gets hairer 
> if you think
> about crossplatform; UTF-16 is not suitable for native use on Unix
> systems (the whole reason UTF-8 was created), so the user should be
> encouraged to use UTF-8 for the default Unicode encoding on those
> platforms.

OK, UTF-8 is my favorite default UTF too. However, whatever the default is,
it is easier to just call it "Unicode", and call the other options "Unicode
(something else)".

That puts one less acronym in front of the "naive" user. The expert user is
supposed to know what the default UTF is on her platform.

> > A ideal interface should probably automatically and 
> silently select Unicode
> > (and its default UTF) whenever one or more of the 
> characters in a document
> > are not representable in the local encoding.
> 
> NO! Don't ever save in an encoding that isn't the local encoding
> without user interaction. 

OK, I withdraw the word "silently"! However, even a simple prompt like "One
ore more characters cannot be saved in the current locale. Do you want to
save in Unicode?" *is* interaction, and it tells the user everything she
needs to know to decide.

_ Marco




Re: UTF-16 is not Unicode

2002-02-12 Thread David Starner

On Tue, Feb 12, 2002 at 11:22:01AM +0100, Marco Cimarosti wrote:
> At best, the localization could use a label such as "Unicode (UTF-8)" to
> enforce the concept that UTF-8 is Unicode as well. But it could hardly use
> "Unicode (UTF-16BE)" for the *default* UTF, because the user would ask
> "Where is *plain* 'Unicode'?"

Then possibly the user should be educated. This gets hairer if you think
about crossplatform; UTF-16 is not suitable for native use on Unix
systems (the whole reason UTF-8 was created), so the user should be
encouraged to use UTF-8 for the default Unicode encoding on those
platforms.
 
> A ideal interface should probably automatically and silently select Unicode
> (and its default UTF) whenever one or more of the characters in a document
> are not representable in the local encoding.

NO! Don't ever save in an encoding that isn't the local encoding
without user interaction. You can very quickly get mojibake with that -
the next program will probably open the file in the local encoding. It
also tends to annoy the user when the program picks the encoding,
ignoring what everything else on the system uses. Offer to change the
encoding or tranliterate.
 
> When the user selects "International (Unicode)", he should be allowed to
> enter an "Advanced settings" menu which, for this encoding, allows choosing
> between "8 bit ASCII-compatible (UTF-8)", "16 bit with surrogates support
> (UTF-16)", "flat 32 bit (UTF-32)". Selecting "16 bit ... (UTF-16)" shows
> extra choices like "Big-endian" vs. "Little Endian".

That's too painful, unless you're doing something like Netscape that has
to provide a million encoding options. UTF-32 is worthless as a disk
encoding. Handle both endians of UTF-16 correctly, and assume everyone
else does - if they don't, then a converter program can be found and
used. That reduces your choices to two: UTF-8 and UTF-16. If you must
complicate things, SCSU is a more useful complication.
 
-- 
David Starner / Давид Старнзр - [EMAIL PROTECTED]
Pointless website: http://dvdeug.dhis.org
What we've got is a blue-light special on truth. It's the hottest thing 
with the youth. -- Information Society, "Peace and Love, Inc."




Re: UTF-16 is not Unicode

2002-02-12 Thread Tom Gewecke

>As a poor software maker, I suppose I ought to defend other software
>makers. >EVERYONE KNOWS that Unicode and UTF-16 are the same thing. It is,
>>unfortunately, irrelevant that in this case (as in so many others) "what
>>everyone knows" happens to be untrue. We exist to conform to the user's
>>expectations, not to educate him; still less to confuse him by replacing
>a nice >simple word (Unicode) with indigestible code letters and digits
>(UTF-16BE or >whatever).

>That said, has anyone a suggestion for names of available output formats
>(as >presented to an end user) that would not confuse the user but would
>satisfy the >purist?

Just about anything would seen to be an improvement over the PR which
Unicode now gets every time a user selects "Unicode" (when it means UTF-16)
on his browser encoding menu and gets total gibberish on his screen.






RE: UTF-16 is not Unicode

2002-02-12 Thread Marco Cimarosti

Martin Kochanski wrote:
> >From: Tom Gewecke <[EMAIL PROTECTED]>
[...]
> > I constantly run into browser, mail, and text editing software
> > with encoding menus that list, as two separate items, Unicode
> > and UTF-8, as if Unicode and UTF-16 were identical and as if
> >  UTF-8 were not Unicode.
>
> As a poor software maker, I suppose I ought to defend other 
> software makers. EVERYONE KNOWS that Unicode and UTF-16 are 
> the same thing. It is, unfortunately, irrelevant that in this 
> case (as in so many others) "what everyone knows" happens to 
> be untrue. We exist to conform to the user's expectations, 
> not to educate him; still less to confuse him by replacing a 
> nice simple word (Unicode) with indigestible code letters and 
> digits (UTF-16BE or whatever).

The terms that actually get in to common usage follow unpredictable routes.
I guess that if users commonly say "Save the file as Unicode" as opposed to
"Save it in UTF-8", the labels on the menu should reflect this language, or
the user would not know what to do.

At best, the localization could use a label such as "Unicode (UTF-8)" to
enforce the concept that UTF-8 is Unicode as well. But it could hardly use
"Unicode (UTF-16BE)" for the *default* UTF, because the user would ask
"Where is *plain* 'Unicode'?"

> That said, has anyone a suggestion for names of available 
> output formats (as presented to an end user) that would not 
> confuse the user but would satisfy the purist?

Before trying answering this question, we should perhaps consider whether
the user needs all this details, and all of them at the same hierarchical
level.

A ideal interface should probably automatically and silently select Unicode
(and its default UTF) whenever one or more of the characters in a document
are not representable in the local encoding.

Only if the user selects a menu like "Manual encoding settings", she should
be presented with a choice like "International (Unicode)", that opposes to
"Western (ISO 8859-1)", "Chinese, simplified (GB 2312-80)", and so on. All
entries should have a generic descriptive label together with a precise
geek-friendly label in parenthesis.

When the user selects "International (Unicode)", he should be allowed to
enter an "Advanced settings" menu which, for this encoding, allows choosing
between "8 bit ASCII-compatible (UTF-8)", "16 bit with surrogates support
(UTF-16)", "flat 32 bit (UTF-32)". Selecting "16 bit ... (UTF-16)" shows
extra choices like "Big-endian" vs. "Little Endian".

Such an interface would *teach* the user the exact relationship between the
various choices. But, of course, this requires time and resources for a
complete redesign of the encoding menu. If the developers are just given the
time to throw in a flat list with all the options, we can't blame them for
the result...

_ Marco




Re: UTF-16 is not Unicode

2002-02-12 Thread Martin Kochanski

As a poor software maker, I suppose I ought to defend other software makers. EVERYONE 
KNOWS that Unicode and UTF-16 are the same thing. It is, unfortunately, irrelevant 
that in this case (as in so many others) "what everyone knows" happens to be untrue. 
We exist to conform to the user's expectations, not to educate him; still less to 
confuse him by replacing a nice simple word (Unicode) with indigestible code letters 
and digits (UTF-16BE or whatever).

That said, has anyone a suggestion for names of available output formats (as presented 
to an end user) that would not confuse the user but would satisfy the purist?

>Date: Mon, 11 Feb 2002 07:10:23 -0700
>From: Tom Gewecke <[EMAIL PROTECTED]>
>Subject: Re: Unicode 3.2 comments 
>
>
>I'm not qualified to comment on the various issues raised by Mr. Hopwood,
>but I do hope that the definitions can be written to avoid confusion
>between Unicode as such and the various UTF's.  I constantly run into
>browser, mail, and text editing software with encoding menus that list, as
>two separate items, Unicode and UTF-8, as if Unicode and UTF-16 were
>identical and as if UTF-8 were not Unicode.