Re: UTF-16 is not Unicode
* Michael Everson | | I think it's clear that Unicode should give some advice as to how to | announce encoding options in a useful way to the end user. For the | two encodings we are discussing, may I suggest the following | standard menu items: | | Unicode (Raw, UTF-16) | Unicode (Web, UTF-8) I don't think calling it "raw" is very good. It just keeps alive the myth that UTF-16 *is* Unicode. None of the UTFs are "raw", and the closest must surely be UTF-32. The below would probably be easier: Unicode (UTF-16) Unicode (UTF-8, default) If you know what you're doing you can choose what you want. If not, you should just choose UTF-8. -- Lars Marius Garshol, Ontopian http://www.ontopia.net > ISO SC34/WG3, OASIS GeoLang TChttp://www.garshol.priv.no >
RE: UTF-16 is not Unicode
David Starner wrote: > On Tue, Feb 12, 2002 at 08:12:08PM +0100, Marco Cimarosti wrote: > > OK, UTF-8 is my favorite default UTF too. However, whatever > the default is, > > it is easier to just call it "Unicode", and call the other > options "Unicode > > (something else)". > > > > That puts one less acronym in front of the "naive" user. > The expert user is > > supposed to know what the default UTF is on her platform. > > What happens when a user is told to save in UTF-16? What > about when two users running different operating systems > try to pass files about? And why would Unicode be any > clearer to a naive user than UTF-16? I only have a definite answer for the last question: everybody I know who works on computer know what Unicode is, but I never met out of this mailing list anyone who was familiar with the acronym "UTF-16"... By the way, because of its transparent etymology, the name "Unicode" is also quite self-explanatory. On the other hand, "UTF-16" is just one more 3-letter acronym for one more 16-bit technology, so it could be mistaken with all kinds of computer-related products. > IMO, UTF-16 is as clear as Unicode, and more accurate. Being > consistent among platforms is a needed plus. Perhaps "more accurate", but definitely not "as clear as". _ Marco
Re: UTF-16 is not Unicode
At 14:28 -0600 2002-02-12, David Starner wrote: > >What happens when a user is told to save in UTF-16? What about when two >users running different operating systems try to pass files about? And >why would Unicode be any clearer to a naive user than UTF-16? > >IMO, UTF-16 is as clear as Unicode, and more accurate. Being consistent >among platforms is a needed plus. Internet Explorer calls Unicode (UTF-8) "Universal Alphabet". Now I would say pretty much the same thing to the layman, but the distinction between what's on the web (UTF-8) and what might be coded elsewhere (UTF-16) should be made. Apple's TextExit with OS X gives a set of choices for encodings to the user in the Open File dialogue: Western (Mac Roman) Western (Windows Latin 1) Japanese (Mac OS) Japanese (Shift JIS) Traditional Chinese (Mac OS) Simplified Chinese (Mac OS) Korean (Mac OS) Unicode UTF-8 Apple's OS 9 WorldText can save as UTF-16 but calls it "Standard Unicode". Cyclone's Unicode options are four: Standard (16 bit) Standard (16 bit) Canonical Decomposition UTF-7 UTF-8 These distinctions are interesting. It shows that many users are expected to know, or find out, about specific encoding differences. I think it's clear that Unicode should give some advice as to how to announce encoding options in a useful way to the end user. For the two encodings we are discussing, may I suggest the following standard menu items: Unicode (Raw, UTF-16) Unicode (Web, UTF-8) -- Michael Everson *** Everson Typography *** http://www.evertype.com
Re: UTF-16 is not Unicode
* Marco Cimarosti | | Only if the user selects a menu like "Manual encoding settings", she | should be presented with a choice like "International (Unicode)", | that opposes to "Western (ISO 8859-1)", "Chinese, simplified (GB | 2312-80)", and so on. All entries should have a generic descriptive | label together with a precise geek-friendly label in parenthesis. This is what Mozilla does, and I seem to recall that IE does the same. (I can't check here, being Linux-only at home.) Opera has an encoding menu divided into Unicode / Western European / Central European / ... / Chinese / Japanese / Korean, where the choices on the next level are UTF-8, UTF-16 / ISO 8859-1, ISO 8859-15, Windows-1252 / ... Seems to work pretty well. -- Lars Marius Garshol, Ontopian http://www.ontopia.net > ISO SC34/WG3, OASIS GeoLang TChttp://www.garshol.priv.no >
RE: UTF-16 is not Unicode
> A ideal interface should probably automatically and silently select > Unicode > (and its default UTF) whenever one or more of the characters in a document > are not representable in the local encoding. I beg to differ. Silently doing such an unexpected change is guaranteed to confuse the user, especially as she starts exchanging the files or loading in other programs. The interface should warn the user and offer a couple sensible choices, one of them (and maybe the default) being to save using one of the UTFs. YA
Re: UTF-16 is not Unicode
On Tue, Feb 12, 2002 at 08:12:08PM +0100, Marco Cimarosti wrote: > OK, UTF-8 is my favorite default UTF too. However, whatever the default is, > it is easier to just call it "Unicode", and call the other options "Unicode > (something else)". > > That puts one less acronym in front of the "naive" user. The expert user is > supposed to know what the default UTF is on her platform. What happens when a user is told to save in UTF-16? What about when two users running different operating systems try to pass files about? And why would Unicode be any clearer to a naive user than UTF-16? IMO, UTF-16 is as clear as Unicode, and more accurate. Being consistent among platforms is a needed plus. -- David Starner / Ðавид СÑаÑÐ½Ð·Ñ - [EMAIL PROTECTED] Pointless website: http://dvdeug.dhis.org What we've got is a blue-light special on truth. It's the hottest thing with the youth. -- Information Society, "Peace and Love, Inc."
RE: UTF-16 is not Unicode
David Starner wrote: > On Tue, Feb 12, 2002 at 11:22:01AM +0100, Marco Cimarosti wrote: > > At best, the localization could use a label such as > "Unicode (UTF-8)" to > > enforce the concept that UTF-8 is Unicode as well. But it > could hardly use > > "Unicode (UTF-16BE)" for the *default* UTF, because the > user would ask > > "Where is *plain* 'Unicode'?" > > Then possibly the user should be educated. This gets hairer > if you think > about crossplatform; UTF-16 is not suitable for native use on Unix > systems (the whole reason UTF-8 was created), so the user should be > encouraged to use UTF-8 for the default Unicode encoding on those > platforms. OK, UTF-8 is my favorite default UTF too. However, whatever the default is, it is easier to just call it "Unicode", and call the other options "Unicode (something else)". That puts one less acronym in front of the "naive" user. The expert user is supposed to know what the default UTF is on her platform. > > A ideal interface should probably automatically and > silently select Unicode > > (and its default UTF) whenever one or more of the > characters in a document > > are not representable in the local encoding. > > NO! Don't ever save in an encoding that isn't the local encoding > without user interaction. OK, I withdraw the word "silently"! However, even a simple prompt like "One ore more characters cannot be saved in the current locale. Do you want to save in Unicode?" *is* interaction, and it tells the user everything she needs to know to decide. _ Marco
Re: UTF-16 is not Unicode
On Tue, Feb 12, 2002 at 11:22:01AM +0100, Marco Cimarosti wrote: > At best, the localization could use a label such as "Unicode (UTF-8)" to > enforce the concept that UTF-8 is Unicode as well. But it could hardly use > "Unicode (UTF-16BE)" for the *default* UTF, because the user would ask > "Where is *plain* 'Unicode'?" Then possibly the user should be educated. This gets hairer if you think about crossplatform; UTF-16 is not suitable for native use on Unix systems (the whole reason UTF-8 was created), so the user should be encouraged to use UTF-8 for the default Unicode encoding on those platforms. > A ideal interface should probably automatically and silently select Unicode > (and its default UTF) whenever one or more of the characters in a document > are not representable in the local encoding. NO! Don't ever save in an encoding that isn't the local encoding without user interaction. You can very quickly get mojibake with that - the next program will probably open the file in the local encoding. It also tends to annoy the user when the program picks the encoding, ignoring what everything else on the system uses. Offer to change the encoding or tranliterate. > When the user selects "International (Unicode)", he should be allowed to > enter an "Advanced settings" menu which, for this encoding, allows choosing > between "8 bit ASCII-compatible (UTF-8)", "16 bit with surrogates support > (UTF-16)", "flat 32 bit (UTF-32)". Selecting "16 bit ... (UTF-16)" shows > extra choices like "Big-endian" vs. "Little Endian". That's too painful, unless you're doing something like Netscape that has to provide a million encoding options. UTF-32 is worthless as a disk encoding. Handle both endians of UTF-16 correctly, and assume everyone else does - if they don't, then a converter program can be found and used. That reduces your choices to two: UTF-8 and UTF-16. If you must complicate things, SCSU is a more useful complication. -- David Starner / Ðавид СÑаÑÐ½Ð·Ñ - [EMAIL PROTECTED] Pointless website: http://dvdeug.dhis.org What we've got is a blue-light special on truth. It's the hottest thing with the youth. -- Information Society, "Peace and Love, Inc."
Re: UTF-16 is not Unicode
>As a poor software maker, I suppose I ought to defend other software >makers. >EVERYONE KNOWS that Unicode and UTF-16 are the same thing. It is, >>unfortunately, irrelevant that in this case (as in so many others) "what >>everyone knows" happens to be untrue. We exist to conform to the user's >>expectations, not to educate him; still less to confuse him by replacing >a nice >simple word (Unicode) with indigestible code letters and digits >(UTF-16BE or >whatever). >That said, has anyone a suggestion for names of available output formats >(as >presented to an end user) that would not confuse the user but would >satisfy the >purist? Just about anything would seen to be an improvement over the PR which Unicode now gets every time a user selects "Unicode" (when it means UTF-16) on his browser encoding menu and gets total gibberish on his screen.
RE: UTF-16 is not Unicode
Martin Kochanski wrote: > >From: Tom Gewecke <[EMAIL PROTECTED]> [...] > > I constantly run into browser, mail, and text editing software > > with encoding menus that list, as two separate items, Unicode > > and UTF-8, as if Unicode and UTF-16 were identical and as if > > UTF-8 were not Unicode. > > As a poor software maker, I suppose I ought to defend other > software makers. EVERYONE KNOWS that Unicode and UTF-16 are > the same thing. It is, unfortunately, irrelevant that in this > case (as in so many others) "what everyone knows" happens to > be untrue. We exist to conform to the user's expectations, > not to educate him; still less to confuse him by replacing a > nice simple word (Unicode) with indigestible code letters and > digits (UTF-16BE or whatever). The terms that actually get in to common usage follow unpredictable routes. I guess that if users commonly say "Save the file as Unicode" as opposed to "Save it in UTF-8", the labels on the menu should reflect this language, or the user would not know what to do. At best, the localization could use a label such as "Unicode (UTF-8)" to enforce the concept that UTF-8 is Unicode as well. But it could hardly use "Unicode (UTF-16BE)" for the *default* UTF, because the user would ask "Where is *plain* 'Unicode'?" > That said, has anyone a suggestion for names of available > output formats (as presented to an end user) that would not > confuse the user but would satisfy the purist? Before trying answering this question, we should perhaps consider whether the user needs all this details, and all of them at the same hierarchical level. A ideal interface should probably automatically and silently select Unicode (and its default UTF) whenever one or more of the characters in a document are not representable in the local encoding. Only if the user selects a menu like "Manual encoding settings", she should be presented with a choice like "International (Unicode)", that opposes to "Western (ISO 8859-1)", "Chinese, simplified (GB 2312-80)", and so on. All entries should have a generic descriptive label together with a precise geek-friendly label in parenthesis. When the user selects "International (Unicode)", he should be allowed to enter an "Advanced settings" menu which, for this encoding, allows choosing between "8 bit ASCII-compatible (UTF-8)", "16 bit with surrogates support (UTF-16)", "flat 32 bit (UTF-32)". Selecting "16 bit ... (UTF-16)" shows extra choices like "Big-endian" vs. "Little Endian". Such an interface would *teach* the user the exact relationship between the various choices. But, of course, this requires time and resources for a complete redesign of the encoding menu. If the developers are just given the time to throw in a flat list with all the options, we can't blame them for the result... _ Marco
Re: UTF-16 is not Unicode
As a poor software maker, I suppose I ought to defend other software makers. EVERYONE KNOWS that Unicode and UTF-16 are the same thing. It is, unfortunately, irrelevant that in this case (as in so many others) "what everyone knows" happens to be untrue. We exist to conform to the user's expectations, not to educate him; still less to confuse him by replacing a nice simple word (Unicode) with indigestible code letters and digits (UTF-16BE or whatever). That said, has anyone a suggestion for names of available output formats (as presented to an end user) that would not confuse the user but would satisfy the purist? >Date: Mon, 11 Feb 2002 07:10:23 -0700 >From: Tom Gewecke <[EMAIL PROTECTED]> >Subject: Re: Unicode 3.2 comments > > >I'm not qualified to comment on the various issues raised by Mr. Hopwood, >but I do hope that the definitions can be written to avoid confusion >between Unicode as such and the various UTF's. I constantly run into >browser, mail, and text editing software with encoding menus that list, as >two separate items, Unicode and UTF-8, as if Unicode and UTF-16 were >identical and as if UTF-8 were not Unicode.