RE: This spoofing and security thread
The very fact that most of them can be reduced to ASCII and people still find the resulting text useful and accurate to the original is a sign that the important characters in English are in ASCII. And all the standard transliterations - em-dashes - --, c-cedilia - c, e-acute, e-grave - e, o-umlaut - o, shaped quotes - and ' - are from characters in Windows-1252. Well, wouldn't you expect an American standard to properly encode the important characters for English? I would. Only ISO has the luxury of encoding Western Europe languages without catering properly to French and some Nordic language (sorry, forgot which; as for French, I am referring to the lack of oe ligature in iso-8859-1). YA
Unicode and end users
First, let me thank everyone for their wise and experienced comments. This is exactly what this sort of list should be for... For the sake of clarity, let me define two terms: 1. Unicode means Unicode. 2. UNICODE means what an end user thinks when he sees the characters U, n, i, c, o, d, e on the screen, in that order. What we are trying to establish is the exact meaning that UNICODE ought to have - that is, if it can have one at all. I suggest that a more technical definition of UNICODE could be a file format that can be read by programs that read UNICODE. This is pretty certain to be what a user understands by the word! Now in the world of application programs intended for real human beings (as opposed, for example, to specialised technical tools), I cannot see that any program will survive for long if it cannot read, without user intervention, files written in all the self-describing Unicode formats (all those with a BOM). It follows that any of these formats could, with equal propriety, be described as UNICODE. Moving back to output formats: this implies that the only requirement for a program that outputs data should be that if the user asks it to use UNICODE, the program uses one of the self-describing formats. The decision as to *which* of these formats to use would be up to the programmer. Depending on the circumstances, he may hard-wire a specific choice (perhaps whatever is best for the platform), or he may provide a configuration option accessible to more technical users. Now, a question: Are there, in fact, many circumstances in which it is necessary for an end user to create files that do *not* have a BOM at the beginning?
Re: This spoofing and security thread
At 23:43 13/02/02 -0600, David Starner wrote: On Wed, Feb 13, 2002 at 08:46:31PM -0800, Yves Arrouye wrote: What do you mean? I've done works for Project Gutenberg, and looked at a number of books with thoughts of reducing them to ASCII. In my opinion, Windows-1252 has every character that most English books will need, Especially those books that you want to reduce to ASCII :-) The very fact that most of them can be reduced to ASCII and people still find the resulting text useful and accurate to the original is a sign that the important characters in English are in ASCII. And the fact that after reading those books a whole generation of English-speakers will go round Spain (or even the Californian school system) asking people ?cuantos anos tiene? and NOT get the answer they deserve shows, depending on your viewpoint, the patient forbearance of a noble race or the proper humility of a conquered people...
RE: This spoofing and security thread
Yves Arrouye wrote: Well, wouldn't you expect an American standard to properly encode the important characters for English? I would. Only ISO has the luxury of encoding Western Europe languages without catering properly to French and some Nordic language (sorry, forgot which; as for French, I am referring to the lack of oe ligature in iso-8859-1). Perhaps you are referring to the lack of letter š for Finnish. BTW, it also lacks Ÿ for French. Thanks to euro, all this was fixed in ISO 8859-15: A4 € EURO SIGN A6 Š LATIN CAPITAL LETTER S WITH CARON A8 š LATIN SMALL LETTER S WITH CARON B4 Ž LATIN CAPITAL LETTER Z WITH CARON B8 ž LATIN SMALL LETTER Z WITH CARON BC Œ LATIN CAPITAL LIGATURE OE BD œ LATIN SMALL LIGATURE OE BE Ÿ LATIN CAPITAL LETTER Y WITH DIAERESIS _ Marco
RE: Unicode and end users
Martin Kochanski wrote: Are there, in fact, many circumstances in which it is necessary for an end user to create files that do *not* have a BOM at the beginning? AFAIK, UTF-8 files are NOT supposed to have a BOM in them. Why is UTF-16 percieved as UNICODE? Well, we all know it's because UCS-2 used to be the ONLY implementation of Unicode. But there is another important difference between UTF-16 and UTF-8. It is barely possible to misinterpret UTF-16, because it uses shorts and not bytes. On the other hand, UTF-8 and ASCII are in extreme cases identical. Why not have BOM in UTF-8? Probably because of the applications that don't really need to know that a file is in UTF-8, especially since it may be pure ASCII in many cases (e.g. system configuration files). And if Unicode is THE codeset to be used in the future, then at some point in time all files would begin with a UTF-8 BOM. Quite unnecessary. Further problems arise when you concat files or start reading in the middle. To be honest, Unicode meaning UTF-16 and UTF-8 are fine with me. It's what I am used to. For UNIX users UTF-8 is just like EUC or ISO-8859-x, another codeset. The fact that it is universal does not mean it has to be called Unicode, I think UTF-8 is just fine and equally (or more) useful. And on UNIX, it is essential that the user is aware of the codeset that is being used. I keep seeing files being used as examples. Think filesystems, file names. File names would surely not start with a BOM, even if files would. Suppose you have a script that will create some files, it is published on the web, and you want to save it so you can run it. Now, it is up to you, how to save it. If you use UTF-8 filenames, you do not want to save it as some ISO, neither as just any Unicode, but precisely UTF-8. The shell will execute the script and use byte sequences from the file to create filenames. Now, an opposite example. You execute ls ls.out, in a directory that has some filenames (say, old files) in ISO and many others in UTF-8. What format is the resulting file in? Well, since this is happening in the year 2016, the editor will assume it's in UTF-8. We already agreed there are no BOM's in files unless they are UTF-16, so the file must be UTF-8 just like (almost) everything else is. Even if there BOM's would be used, should this file have it? Anyway, some invalid sequences will be encountered by the editor, but then hopefully it will simply display some replacement characters (or ask if it can do so). Hopefully it will allow me to save the file, with invalid sequences intact. Editing invalid sequences (or inserting new ones) would be too much to ask, right? What bothers me a little bit is that I would not be able to save such a file as UTF-16 because of the invalid sequences in it. Why would I? Well, Windows has more and more suppport for UTF-8, so maybe I don't really need to. I still wish I had an option though. This again makes me think that UTF-8 and UTF-16 are not both Unicode. Maybe UTF-16 is 'more' Unicode right now, because of the past. But maybe UTF-8 will be 'more' Unicode in the future, because it can contain invalid sequences and these can be properly interpreted by someone at a later time. Unless UTF-16 has that same ability, it will lose the battle of being an 'equally good Unicode format'. And why do I keep this in the Unicode and end users thread? Because invalid sequences (and old filenames) are a fact that users WILL experience and pretending that this is just a case of non-conformance is not in the best interest of the users. Lars Kristan Storage Data Management Lab HERMES SoftLab
Unicode, Oh Unicode: lyrics
I can't make out the lyrics through my crappy speakers. Are they on line anywhere? -- John Cowan [EMAIL PROTECTED] http://www.reutershealth.com I amar prestar aen, han mathon ne nen,http://www.ccil.org/~cowan han mathon ne chae, a han noston ne 'wilith. --Galadriel, _LOTR:FOTR_
Re: Off-Topic (Re: This spoofing and security thread)
Patrick Andries scripsit: Quite a feat indeed : since e accounts for 13% of letters in a typical English text. Indeed. It's called Gadsby, and the author of La disparition certainly knew it. Interesting. It appears to be online at http://gadsby.hypermart.net/. Lots of nasty pop-up ads there though. -- +---++---+ | Elliotte Rusty Harold | [EMAIL PROTECTED] | Writer/Programmer | +---++---+ | The XML Bible, 2nd Edition (Hungry Minds, 2001) | | http://www.ibiblio.org/xml/books/bible2/ | | http://www.amazon.com/exec/obidos/ISBN=0764547607/cafeaulaitA/ | +--+-+ | Read Cafe au Lait for Java News: http://www.cafeaulait.org/ | | Read Cafe con Leche for XML News: http://www.ibiblio.org/xml/ | +--+-+
Re: Unicode, Oh Unicode: lyrics
On Thu, 14 Feb 2002, John Cowan wrote: I can't make out the lyrics through my crappy speakers. Are they on line anywhere? That's it: Oh beautiful for Uni-Han, for spacious User Zone! For rampant scripts of India and polar Nunavut! Unicode, Oh Unicode May all your code points shine forever and your beacon light the world! Oh, marvelous for sixteen bits, for precious surrogates! For Bi-Di algorithm dear and stalwart I-P-A! Unicode, Oh Unicode May all your code points shine forever and your beacon light the world! Oh, glorious for Hangul fair, for symbols mathematical! For myriad exotic scripts and punctuation we adore! Unicode, Oh Unicode May all your code points shine forever and your beacon light the world! BTW, I was just wondering if a new version will be prepared for 4.0... roozbeh
Re: Off-Topic (Re: This spoofing and security thread)
At 11:59 PM -0500 2/13/02, John Cowan wrote: There is an English translation (or translation): The Void, wherein the hero, Anton Voyl, becomes Anton Vowl. There are German and Danish translations too. Do you happen to know if these translations also avoid the letter e? German's especially impressive since I think e makes up about 20% of the letters in typical German. -- +---++---+ | Elliotte Rusty Harold | [EMAIL PROTECTED] | Writer/Programmer | +---++---+ | The XML Bible, 2nd Edition (Hungry Minds, 2001) | | http://www.ibiblio.org/xml/books/bible2/ | | http://www.amazon.com/exec/obidos/ISBN=0764547607/cafeaulaitA/ | +--+-+ | Read Cafe au Lait for Java News: http://www.cafeaulait.org/ | | Read Cafe con Leche for XML News: http://www.ibiblio.org/xml/ | +--+-+
Re: Off-Topic (Re: This spoofing and security thread)
Elliotte Rusty Harold wrote: At 11:59 PM -0500 2/13/02, John Cowan wrote: There is an English translation (or translation): The Void, wherein the hero, Anton Voyl, becomes Anton Vowl. There are German and Danish translations too. Do you happen to know if these translations also avoid the letter e? German's especially impressive since I think e makes up about 20% of the letters in typical German. 16,7 % http://www.santacruzpl.org/readyref/files/g-l/ltfrqger.shtml 17,5% for French according to http://www.santacruzpl.org/readyref/files/g-l/ltfrqfr.shtml 13,1% for English http://www.santacruzpl.org/readyref/files/g-l/ltfrqeng.shtml 13,7% for Spanish http://www.santacruzpl.org/readyref/files/g-l/ltfrqsp.shtml P. Andries
RE: Off-Topic (Re: This spoofing and security thread)
If my memory is correct, James Thurber also wrote a short (American English) book called The Wonderful O in which he did not use the letter e. Clive -Original Message- From: John Cowan [mailto:[EMAIL PROTECTED]] Sent: Wednesday, February 13, 2002 10:59 PM To: Patrick Andries Cc: Asmus Freytag; Juliusz Chroboczek; [EMAIL PROTECTED] Subject: Re: Off-Topic (Re: This spoofing and security thread) Patrick Andries scripsit: Quite a feat indeed : since e accounts for 13% of letters in a typical English text. Indeed. It's called Gadsby, and the author of La disparition certainly knew it. There is also one in French where e accounts for 15,3% of letters in a typical text It's called La disparition (320 pages without an e), by Georges Perec. Extract http://www2.ec-lille.fr/~book/perec/textes/disparition.shtml There is an English translation (or translation): The Void, wherein the hero, Anton Voyl, becomes Anton Vowl. There are German and Danish translations too. -- John Cowan http://www.ccil.org/~cowan [EMAIL PROTECTED] To say that Bilbo's breath was taken away is no description at all. There are no words left to express his staggerment, since Men changed the language that they learned of elves in the days when all the world was wonderful. --_The Hobbit_
Re: Unicode and end users
MK What we are trying to establish is the exact meaning that UNICODE MK ought to have - that is, if it can have one at all. In the Unix-like world, the term ``UTF-8'' has been used quite consistently, and most documentation avoids using Unicode for a disk format (using it for the consortium, er., the Consortium, the character repertoire and, when useful, for the coded character set). The Unix-like public is used to thinking of UTF-8 as the format in which Unicode text is saved on disk, and ``UTF-8 (Unicode)'' or perhaps ``Unicode (UTF-8)'' should be the preferred user-interface item. MK Are there, in fact, many circumstances in which it is necessary MK for an end user to create files that do *not* have a BOM at the MK beginning? You should never use either BOMs or UTF-16 on Unix-like systems; using either will break too much of the system. Juliusz
Re: Unicode and end users
On Thu, Feb 14, 2002 at 03:57:34PM +, Juliusz Chroboczek wrote: MK What we are trying to establish is the exact meaning that UNICODE MK ought to have - that is, if it can have one at all. In the Unix-like world, the term ``UTF-8'' has been used quite consistently, and most documentation avoids using Unicode for a disk format (using it for the consortium, er., the Consortium, the character repertoire and, when useful, for the coded character set). The Unix-like public is used to thinking of UTF-8 as the format in which Unicode text is saved on disk, and ``UTF-8 (Unicode)'' or perhaps ``Unicode (UTF-8)'' should be the preferred user-interface item. I would rather recommend that you write ISO 10646 UTF-8 as the ISO standard is a standard in many countries while Unicode is not. Kind regards keld
Re: This spoofing and security thread
At 16:51 + 2002-02-14, Juliusz Chroboczek wrote: - a cross-reference of characters whose associated glyphs could be confused by a non-technical user; ME Out of the entire standard? Who's going to do that for free? :-) I don't know. I'm not lobbying anyone here -- I'm just trying to clarify why so many of us are finding it difficult to get to grips with Unicode. (Were you volunteering? ;-) (Michael laughs out loud) Not for free. ;-) Actually the annotations to the Unicode names list includes many such cross references. -- Michael Everson *** Everson Typography *** http://www.evertype.com
Re: Unicode and end users
Lars Kristan [EMAIL PROTECTED] wrote: AFAIK, UTF-8 files are NOT supposed to have a BOM in them. Different operating systems and applications have different preferences. There is no universal right or wrong about this. This is unfortunate, but true. Why is UTF-16 percieved as UNICODE? Well, we all know it's because UCS-2 used to be the ONLY implementation of Unicode. But there is another important difference between UTF-16 and UTF-8. It is barely possible to misinterpret UTF-16, because it uses shorts and not bytes. On the other hand, UTF-8 and ASCII are in extreme cases identical. At the risk of being mistaken for juuitchan by citing a Japanese example: A non-BOM file that starts with the bytes 0x30 0x42 could be the UTF-8 characters 0B, or it could be the UTF-16BE character HIRAGANA LETTER A. (A similar situation applies for UTF-16LE.) Now, 0B might not be the first two characters of many novels, but in a techie Unix environment it could easily be the start of a text-format data file. Two common heuristics for determining whether a file is UTF-16 are to check whether every other byte is 0x00, or whether every other byte is the same. The former fails for non-Latin scripts, the latter fails (less frequently) for scripts that are not part of a smallish alphabet. That's the problem with no BOM: you have to resort to heuristics, or external tagging. Why not have BOM in UTF-8? Probably because of the applications that don't really need to know that a file is in UTF-8, especially since it may be pure ASCII in many cases (e.g. system configuration files). And if Unicode is THE codeset to be used in the future, then at some point in time all files would begin with a UTF-8 BOM. Quite unnecessary. Further problems arise when you concat files or start reading in the middle. That's why U+2060 WORD JOINER is being introduced in Unicode 3.2. Hopefully it will take over the ZWNBSP semantics from U+FEFF, which can then be used *solely* as a BOM. Eventually, if this happens, it will become safe to strip BOM's as they appear. (Of course, if you are splitting or concatenating files, you should not do any interpretation anyway.) I have never seen a non-pathological example where stripping a file- or stream-initial U+FEFF was harmful because of the possibility that it was intended as ZWNBSP. ZWNBSP (or WORD JOINER) affects the behavior of the characters before and after it. If there is no character before ZWNBSP, it doesn't belong there. [O]n UNIX, it is essential that the user is aware of the codeset that is being used. Unix users are accustomed to dealing with such details. Anyway, some invalid sequences will be encountered by the editor, but then hopefully it will simply display some replacement characters (or ask if it can do so). Hopefully it will allow me to save the file, with invalid sequences intact. Editing invalid sequences (or inserting new ones) would be too much to ask, right? What bothers me a little bit is that I would not be able to save such a file as UTF-16 because of the invalid sequences in it. Why would I? Well, Windows has more and more suppport for UTF-8, so maybe I don't really need to. I still wish I had an option though. This again makes me think that UTF-8 and UTF-16 are not both Unicode. Maybe UTF-16 is 'more' Unicode right now, because of the past. But maybe UTF-8 will be 'more' Unicode in the future, because it can contain invalid sequences and these can be properly interpreted by someone at a later time. Unless UTF-16 has that same ability, it will lose the battle of being an 'equally good Unicode format'. I don't think the fact that invalid sequences are possible in UTF-8 and not in UTF-16 makes UTF-8 inferior, or any less Unicode. It was designed that way. Invalid sequences always represent a problem, just like line noise. They should not be treated as a normal situation. -Doug Ewell Fullerton, California
Smiles, faces, etc
This mailing list seems to be the first place for this, so... There are two face characters in the Miscellaneous group. Was wondering if it would be appropriate to expand upon those two, possibly in its own block, and add a series of smiles/faces/emoticons to the unicode standard. Like 'em or hate 'em, those :) are here to stay. ...and there's at least twelve easily identifiable faces in common use on the internet. Anyone have thoughts on this? --Harry Davis
FW: This spoofing and security thread
-Original Message- From: Hietaniemi Jarkko (NRC/Boston) Sent: Thursday, February 14, 2002 12:43 To: 'ext Marco Cimarosti' Subject: RE: This spoofing and security thread Perhaps you are referring to the lack of letter š for Finnish. BTW, it also lacks Ÿ for French. Thanks to euro, all this was fixed in ISO 8859-15: A4 € EURO SIGN A6 Š LATIN CAPITAL LETTER S WITH CARON A8 š LATIN SMALL LETTER S WITH CARON B4 Ž LATIN CAPITAL LETTER Z WITH CARON B8 ž LATIN SMALL LETTER Z WITH CARON BC Œ LATIN CAPITAL LIGATURE OE BD œ LATIN SMALL LIGATURE OE BE Ÿ LATIN CAPITAL LETTER Y WITH DIAERESIS Yup. Strictly speaking, though, the caroned s and z are not needed for native Finnish words, but they are needed for the proper spelling of few Finnishized loanwords like šakki chess šekki cheque and for the proper spelling of Finnish transliteration of Cyrillic names. (The traditional workaround for not having the letters has been to use sh and zh.) I think the caron versions also make the Sámi people happier.
RE: This spoofing and security thread
:- a map from characters to languages. : : This has been attempted for some sets of latin based languages. I don't : have a link to one of the documents that do that. Main problem is that : many *more* characters are actually used (and used quite commonly) by users : of these languages, than acknowledged by the makers of these lists. http://www.eki.ee/letter/ looks reasonably extensive.
Re: Unicode and end users
On Thu, Feb 14, 2002 at 05:46:46PM +0100, Keld Jørn Simonsen wrote: I would rather recommend that you write ISO 10646 UTF-8 as the ISO standard is a standard in many countries while Unicode is not. *Grumble*. The whole point of this discussion is making it clear for the users. Unicode is more clear for more users than ISO 10646 is. There is no reason to use ISO 10646, besides pedanticness. -- David Starner / Давид Старнэр - [EMAIL PROTECTED] Pointless website: http://dvdeug.dhis.org What we've got is a blue-light special on truth. It's the hottest thing with the youth. -- Information Society, Peace and Love, Inc.
Re: Unicode and end users
At 14:16 -0600 2002-02-14, David Starner wrote: The whole point of this discussion is making it clear for the users. Unicode is more clear for more users than ISO 10646 is. There is no reason to use ISO 10646, besides pedanticness. It is ISO/IEC 10646. -- Michael Everson *** Everson Typography *** http://www.evertype.com
Re: Unicode and end users
From: Michael Everson [EMAIL PROTECTED] At 14:16 -0600 2002-02-14, David Starner wrote: There is no reason to use ISO 10646, besides pedanticness. It is ISO/IEC 10646. The defense rests. MichKa Michael Kaplan Trigeminal Software, Inc. -- http://www.trigeminal.com/
Re: Off-Topic (Re: This spoofing and security thread)
This was discussed in a book I recently read, called Code (don't recall the author right now). Apparently the Danish (I think) translation has an error, but only one. I guess the proof reader was not familiar with grep :) Barry At 08:23 AM 2/14/2003 -0500, Elliotte Rusty Harold wrote: At 11:59 PM -0500 2/13/02, John Cowan wrote: There is an English translation (or translation): The Void, wherein the hero, Anton Voyl, becomes Anton Vowl. There are German and Danish translations too. Do you happen to know if these translations also avoid the letter e? German's especially impressive since I think e makes up about 20% of the letters in typical German. -- +---++---+ | Elliotte Rusty Harold | [EMAIL PROTECTED] | Writer/Programmer | +---++---+ | The XML Bible, 2nd Edition (Hungry Minds, 2001) | | http://www.ibiblio.org/xml/books/bible2/ | | http://www.amazon.com/exec/obidos/ISBN=0764547607/cafeaulaitA/ | +--+-+ | Read Cafe au Lait for Java News: http://www.cafeaulait.org/ | | Read Cafe con Leche for XML News: http://www.ibiblio.org/xml/ | +--+-+
Re: Off-Topic (Re: This spoofing and security thread)
At 17:23 -0500 2002-02-14, John Cowan wrote: Well, the German translation also has one e in it -- Gib uns das tägliche Brot, and Perec apparently (the facts are not quite certain) told someone that there *was* a single e in the original -- he did not disclose its whereabouts. Well, somebody go to Gutenberg and run a search. -- Michael Everson *** Everson Typography *** http://www.evertype.com
Re: GB 18030 question
I have additional question about GB18030 the following code point in GB18030 are map to Private Usaer Araea in Unicode but have a glyph in the GB18030 standard. What does that mean ? page 11 of GB180300xA6EC0xA6ED0xA6F30xA6D9 - 0xA6DFpage 81 of GB180300xFE50 - 0xFEA0ref- http://bugzilla.mozilla.org/show_bug.cgi?id=125407 Qingjiang (Brian) Yuan wrote: [EMAIL PROTECTED]"> Frank and Deborah, After I saw the e-mail from Deborah, I asked our Beijing office tocontact the CESI. The follow is the information we got:--Have contacted with CESI. It is really a glyph bug. They have fixed it,but they did not notify us!CESI will not give us the updated fonts until tomorrow morning. It wassaid that there are serial glyph have been updated in the new version ofthe bitmap fonts.--Thanks.Brian.Yung-Fong Tang Wrote: I looks like both Mac/Linux/Window N6.2 and current Mozilla map that toFFE3. Looks like IE on winXP do the same way.We, mozilla i18n group, got the GB18030 mapping table from sun. B Yuan,any comment?Michael Everson wrote: At 11:23 -0800 2002-02-01, Deborah Goldsmith wrote: There is an error on page 10 of the GB 18030-2000 standard, in thatthe character with code point A3FE maps to U+FFE3 (FULLWIDTH MACRON),but is shown with a glyph that corresponds to U+FF5E (FULLWIDTHTILDE). The position of the character in its code block would alsoseem to indicate that tilde was intended.Does anyone have any idea of which should be considered correct, theglyph or the Unicode mapping value? Glyphs are informative in JTC1. I can only assume that the GBstandards would follow suit.
Re: GB 18030 question
Yung-Fong Tang wrote: I have additional question about GB18030 the following code point in GB18030 are map to Private Usaer Araea in Unicode but have a glyph in the GB18030 standard. What does that mean ? It means those characters/symbols are not in Unicode 3.0. The following are the Characters that are not in Unicode 3.0 according to the CESI: GB18030 Unicode (Private Use Area) A8BCE7C7 FE51E816 FE52E817 FE53E818 FE59E81E FE61E826 FE66E82B FE67E82C FE6CE831 FE6DE832 FE76E83B FE7EE843 FE90E854 FE91E855 FEA0E864 But looks like there are more symbols that are not in Unicode 3.0. Brian. page 11 of GB18030 0xA6EC 0xA6ED 0xA6F3 0xA6D9 - 0xA6DF page 81 of GB18030 0xFE50 - 0xFEA0 ref- http://bugzilla.mozilla.org/show_bug.cgi?id=125407 Qingjiang (Brian) Yuan wrote: Frank and Deborah, After I saw the e-mail from Deborah, I asked our Beijing office to contact the CESI. The follow is the information we got: -- Have contacted with CESI. It is really a glyph bug. They have fixed it, but they did not notify us! CESI will not give us the updated fonts until tomorrow morning. It was said that there are serial glyph have been updated in the new version of the bitmap fonts. -- Thanks. Brian. Yung-Fong Tang Wrote: I looks like both Mac/Linux/Window N6.2 and current Mozilla map that to FFE3. Looks like IE on winXP do the same way. We, mozilla i18n group, got the GB18030 mapping table from sun. B Yuan, any comment? Michael Everson wrote: At 11:23 -0800 2002-02-01, Deborah Goldsmith wrote: There is an error on page 10 of the GB 18030-2000 standard, in that the character with code point A3FE maps to U+FFE3 (FULLWIDTH MACRON), but is shown with a glyph that corresponds to U+FF5E (FULLWIDTH TILDE). The position of the character in its code block would also seem to indicate that tilde was intended. Does anyone have any idea of which should be considered correct, the glyph or the Unicode mapping value? Glyphs are informative in JTC1. I can only assume that the GB standards would follow suit.
Re: Unicode and end users
At 09:22 AM 2/14/02 +, Martin Kochanski wrote: Are there, in fact, many circumstances in which it is necessary for an end user to create files that do *not* have a BOM at the beginning? In principle this is a requirement for data being labelled *external to the date* as being in either UTF-16BE or UTF-16LE (ditto for UTF-32). These formats *must not* have a BOM. However, it may be the case in practice that protocols in which documents are labelled that way, don't accept separately edited documents, so this may be moot. UTF-8 should *never* contain the BOM. A./
Re: Smiles, faces, etc
Falkor wrote: Like 'em or hate 'em, those :) are here to stay. ...and there's at Probably, although the more people from outside the computer-tech world join in, the smaller percentage of people will use these, like my mother-in-law... They are already encoded in Unicode, using two or more Unicode characters... using a colon and a closing parenthesis (I personally prefer the version with a dash nose) is all you need. There are a couple of real smileys too, but some modern emailers actually recognize the regular form and display an image. If you replace the multi-character form, then you will break old software without much benefit. markus PS: ... and at the end of the day, Unicode is a _text_ encoding standard ... :-)
Re: Smiles, faces, etc
Markus Scherer wrote: Falkor wrote: Like 'em or hate 'em, those :) are here to stay. ...and there's at Probably, although the more people from outside the computer-tech world join in, the smaller percentage of people will use these, like my mother-in-law... They are already encoded in Unicode, using two or more Unicode characters... using a colon and a closing parenthesis (I personally prefer the version with a dash nose) is all you need. Methinks «We know what you need» is a bit patronizing. There are a couple of real smileys too, but some modern emailers actually recognize the regular form the « regular »... the contrived way you mean. and display an image. for what of a character. PS: ... and at the end of the day, Unicode is a _text_ encoding standard ... :-) Yea, yea and this punctuation ;-) isn't text right ? Why ? Because there is no character ;-) ! Why ? Because people already have what they want ! And we know what they want. Patrick
Re: Smiles, faces, etc
Patrick Andries wrote: There are a couple of real smileys too, but some modern emailers actually recognize the regular form and display an image. for what of a character. I meant for want of a character. P. Andries
Re: Smiles, faces, etc
On Thu, Feb 14, 2002 at 08:56:25PM -0500, Patrick Andries wrote: They are already encoded in Unicode, using two or more Unicode characters... using a colon and a closing parenthesis (I personally prefer the version with a dash nose) is all you need. Methinks «We know what you need» is a bit patronizing. That doesn't mean it's not right. There's a lot of absurd solutions created by people with problems, and a lot of solutions to problems that don't exist. There are a couple of real smileys too, but some modern emailers actually recognize the regular form the « regular »... the contrived way you mean. The regular way; the most common way; the way people actually use. PS: ... and at the end of the day, Unicode is a _text_ encoding standard ... :-) Yea, yea and this punctuation ;-) isn't text right ? Why ? Because there is no character ;-) ! See the FAQ. There's no character MALTESE IE, or SPANISH LL, either, but they are still usuable in plain text. Unless Unicode is willing to dedicate several hundred characters to these, there will be many similies that will be unencoded. And unless Microsoft is willing to add it to their keyboards, most people won't be able to use it directly. So once most systems support it - in what, 4-5 years? - programs may autoreplace the smilie. So IM's will send 3 bytes across the net to replace three byte-sized ASCII characters, with the same net effect, but having succesfully broken backward compatibility with anybody using older hardware or software. -- David Starner / Давид Старнэр - [EMAIL PROTECTED] Pointless website: http://dvdeug.dhis.org What we've got is a blue-light special on truth. It's the hottest thing with the youth. -- Information Society, Peace and Love, Inc.
Re: Smiles, faces, etc
On 2/14/02 8:34 PM, Markus Scherer [EMAIL PROTECTED] wrote: They are already encoded in Unicode, using two or more Unicode characters... using a colon and a closing parenthesis (I personally prefer the version with a dash nose) is all you need. The same could be said about dingbat arrows... Like dash-greaterthan and lessthan-dash-equal... Or superimposing a circumflex over a vertical bar. The impulse to ask about this came about by using multiple emailers, messaging systems, etc and having each interpret the faces and smiles (emoticons) differently. (not unlike a single hex code generating two different characters on different operating systems) The sequence colon-dash-X could be Kiss or Biting tongue, and Halo Angel has been seen as O-colon-closeparen and openparen-A-closeparen. [that is :-X O:) and (A) respectively] I was thinking more that this would allow modern software to translate a lower-ASCII three-character sequence into a single unicode emoticon character that would be displayed properly regardless of OS and software, also alleviating the need for such developers to create proprietary artwork for each. This multiple-keystroke-per-character input method does have precedent with Asian languages. If you replace the multi-character form, then you will break old software without much benefit. Can't make an omelette without breaking eggs. I'm sure Unicode as it is now wreaks havoc on DOS apps :) ...but point taken. PS: ... and at the end of the day, Unicode is a _text_ encoding standard ... :-) True enough. But sometimes text without inflection can be a dangerous thing. This is what emoticons can address. Besides, Dingbats and Miscellaneous Symbols aren't exactly textual. ...and if you can show me a document written with the Box Drawing block, I'd be impressed. :) With all due respect, --Harry
RE: Unicode and end users
Can you please expand on your statement that UTF-8 should never have a BOM? Having one makes it very easy to distinguish a text file that contains UTF-8 from one that contains text in the system default MBCS encoding. You may not be surprised to learn that Microsoft (or, at least, one of its programmers) does not agree with you. When I save a file from Notepad on Windows XP in UTF-8, the file contains a BOM. (I have no connection with Microsoft - I'm just a programmer who has to write code to import text files from time to time!) Thanks - rick cameron -Original Message- From: Asmus Freytag [mailto:[EMAIL PROTECTED]] Sent: Thursday, 14 February 2002 17:46 To: Martin Kochanski; [EMAIL PROTECTED] Subject: Re: Unicode and end users At 09:22 AM 2/14/02 +, Martin Kochanski wrote: Are there, in fact, many circumstances in which it is necessary for an end user to create files that do *not* have a BOM at the beginning? In principle this is a requirement for data being labelled *external to the date* as being in either UTF-16BE or UTF-16LE (ditto for UTF-32). These formats *must not* have a BOM. However, it may be the case in practice that protocols in which documents are labelled that way, don't accept separately edited documents, so this may be moot. UTF-8 should *never* contain the BOM. A./
Re: Smiles, faces, etc
On Thu, Feb 14, 2002 at 10:28:19PM -0500, Falkor wrote: Miscellaneous Symbols aren't exactly textual. ...and if you can show me a document written with the Box Drawing block, I'd be impressed. :) I don't have an example at hand, but if you dig up an old DOS shareware disk and poke through the README files, it won't take that long to come up with that one that used the Box Drawing characters in CP437. -- David Starner / Давид Старнэр - [EMAIL PROTECTED] Pointless website: http://dvdeug.dhis.org What we've got is a blue-light special on truth. It's the hottest thing with the youth. -- Information Society, Peace and Love, Inc.
RE: Unicode and end users
UTF-8 should *never* contain the BOM. But has been pointed out, it is common practice for Microsoft, and also for ICU's genrb tool, for example, which uses the BOM to autodetect the encoding. The more example you'll see of that, the more people will use the BOM (now, can't we all use -*- coding: utf-8 -*- ;-)?). YA
Re: Smiles, faces, etc
On Thu, Feb 14, 2002 at 10:55:04PM -0500, Patrick Andries wrote: The regular way; the most common way; the way people actually use. Well, because there is no other way with a keyboard. But what do people do with a pencil ? What is the way people actually draw smileys then ? Tilted 90° ? People add these things to written text? I've never seen it, and it doesn't sound like you have, either. Unless Unicode is willing to dedicate several hundred characters to these, there will be many similies that will be unencoded. Which is obviously an argument to encode none (or only those that are legacy). Now, granted the problem is to determine what is the set that could be encoded and here ISO/Unicode hasn't got its work cut out for itself : there is no prior approved set. I misstated myself; the problem is not that the number is large, is that it's openended. (-. is a valid smiley, as is :-;. I admit that there is a practical limitation as far as inputing these characters is concerned, but then how many Unicode characters has Microsoft (?) added to its [US ?] keyboard. (Yeah, Microsoft. One heck of keyboard, though a little fragile for my tastes. If I could just get one of the old steel keyboards with all the bucky bits in a split layout . . .) But I can enter LATIN CAPITAL LETTER HVAIR when I need it. People aren't going to pull up the character map when they need a smiley - they'll just type it in. So once most systems support it - in what, 4-5 years? - programs may autoreplace the smilie. They already do. I'm not really sure I understand you. Are you aware that I didn't need to use the «regular way» to get ☺ and :-) ? One out of two ain't bad, I guess. That was garbage on the screens of some of the subscribers, though - UTF-8 display is still not universal. The point, though, was that it will take a year, maybe more, to standardize the characters. It will take another couple years for new systems to regularly provide fonts for them. And it will take yet another couple years for people to have regularly upgraded their OS to the newest system. Are we really obsessed about byte size ? The effect is not net : you would now have characters which can take different appearances (font variants if you want). They can then be straight up (normal instead of tilted), coloured or even animated. Huh? If you want that, you're going to have to transmit inline graphics. You can't animate glyphs in a font. You can color a current ASCII smiley with HTML as easy as you can any new smiley, and a color drawing of a face is just that, a color drawing, not text. I wonder sometimes if the largest obstacle in the encoding of smileys as characters is not the universal normalization process itself. The problem is, they are fundamentally ASCII text art, that appear only in computer systems, and only there as ASCII text art. There's no prior art to point to, except for systems that clearly display them as graphical objects, not text. -- David Starner / Давид Старнэр - [EMAIL PROTECTED] Pointless website: http://dvdeug.dhis.org What we've got is a blue-light special on truth. It's the hottest thing with the youth. -- Information Society, Peace and Love, Inc.
Re: Smiles, faces, etc
David Starner wrote: [EMAIL PROTECTED]"> People add these things to written text? I've never seen it, and itdoesn't sound like you have, either.> I wonder how you know this. I do write smileys on piece of papers. [EMAIL PROTECTED]"> Unless Unicode is willing to dedicate several hundred characters tothese, there will be many similies that will be unencoded. Which is obviously an argument to encode none (or only those that are "legacy"). Now, granted the problem is to determine what is the set that could be encoded and here ISO/Unicode hasn't got its work cut out for itself : there is no prior approved set. I misstated myself; the problem is not that the number is large, is thatit's openended. "(-." is a valid smiley, as is ":-;". Yes and so is the ideographic collection : it is open-ended. [EMAIL PROTECTED]"> So once most systems support it - in what, 4-5 years? - programs may autoreplace the smilie. They already do. I'm not really sure I understand you. Are you aware that I didn't need to use the «regular way» to get ☺ and :-) ? One out of two ain't bad, I guess. That was garbage on the screens ofsome of the subscribers, though - UTF-8 dispplay is still not universal. Oh, I see, no Unicode characters now...lest old hardware breaks down, right ? ;-) [EMAIL PROTECTED]"> The point, though, was that it will take a year, maybe more, tostandardize the characters. It will take another couple years for newsystems to regularly provide fonts for them. And it will take yetanother couple years for people to have regularly upgraded their OS tothe newest system. This applies to any new character. [EMAIL PROTECTED]"> Are we really obsessed about byte size ? The effect is not net : you would now have characters which can take different appearances (font variants if you want). They can then be straight up (normal instead of tilted), coloured or even animated. Huh? If you want that, What ? A straight up smiley ? A bold smiley ? A different design ? [EMAIL PROTECTED]"> you're going to have to transmit inline graphics. No, that can be left to the receiving end (stylesheet, font settings, etc.). Enough (for me). P. Andries
RE: Unicode and end users
Can you please expand on your statement that UTF-8 should never have a BOM? Having one makes it very easy to distinguish a text file that contains UTF-8 from one that contains text in the system default MBCS encoding. You may not be surprised to learn that Microsoft (or, at least, one of its programmers) does not agree with you. When I save a file from Notepad on Windows XP in UTF-8, the file contains a BOM. It seems there are quite a few answers to these questions in the Unicode utf-bom faq http://www.unicode.org/unicode/faq/utf_bom.html including mention of the Microsoft case and the fact that generally a BOM can be used with any UTF.
Re: Smiles, faces, etc
On Thu, Feb 14, 2002 at 11:48:04PM -0500, Patrick Andries wrote: People add these things to written text? I've never seen it, and it doesn't sound like you have, either. I wonder how you know this. I do write smileys on piece of papers. I inferred that from your question about how people write them. I apologize if that was a mistake inference. One out of two ain't bad, I guess. That was garbage on the screens of some of the subscribers, though - UTF-8 display is still not universal. Oh, I see, no Unicode characters now...lest old hardware breaks down, right ? ;-) If your goal to communicate, then you pick your tools wisely. Gratitious use of Unicode smileys with people who may not be running the latest system is not productive to communication. The point, though, was that it will take a year, maybe more, to standardize the characters. It will take another couple years for new systems to regularly provide fonts for them. And it will take yet another couple years for people to have regularly upgraded their OS to the newest system. This applies to any new character. True, and many people who might try to get a new character encoded think again, and look for another solution. A character that is part of many ancient classics is worth waiting to encode. An ephemeral character like most smileys just isn't. Huh? If you want that, What ? A straight up smiley ? A bold smiley ? A different design ? You have bold smileys. If you want animations, color. you're going to have to transmit inline graphics. No, that can be left to the receiving end (stylesheet, font settings, etc.). But modern systems don't have the capablity to animate or color (in more than one color) characters. That's graphics. For a proposal, you'd need examples of the character being used in print, as a character and not a graphic. Do you have any examples? -- David Starner / Давид Старнэр - [EMAIL PROTECTED] Pointless website: http://dvdeug.dhis.org What we've got is a blue-light special on truth. It's the hottest thing with the youth. -- Information Society, Peace and Love, Inc.
Re: Smiles, faces, etc
David Starner wrote: For a proposal, you'd need examples of the character being used in print, as a character and not a graphic. Do you have any examples? On tourne en rond, as we say in French. What is a character and not a graphic for you ? Some « thing » that is already encoded as a character ? A « thing » found among (inline) printed text ? A hand-written sign found mixed with other signs called letters or punctuation marks ? Excuse me, if I do not go on with this thread. Patrick Andries