RE: This spoofing and security thread

2002-02-14 Thread Yves Arrouye

 The very fact that most of them can be reduced to ASCII and people still
 find the resulting text useful and accurate to the original is a sign
 that the important characters in English are in ASCII. And all the
 standard transliterations - em-dashes - --, c-cedilia - c, e-acute,
 e-grave - e, o-umlaut - o, shaped quotes -  and ' - are from
 characters in Windows-1252.

Well, wouldn't you expect an American standard to properly encode the
important characters for English? I would. Only ISO has the luxury of
encoding Western Europe languages without catering properly to French and
some Nordic language (sorry, forgot which; as for French, I am referring to
the lack of oe ligature in iso-8859-1).

YA





Unicode and end users

2002-02-14 Thread Martin Kochanski

First, let me thank everyone for their wise and experienced comments. This is exactly 
what this sort of list should be for...

For the sake of clarity, let me define two terms:
1. Unicode means Unicode.
2. UNICODE means what an end user thinks when he sees the characters U, n, i, c, o, 
d, e on the screen, in that order.

What we are trying to establish is the exact meaning that UNICODE ought to have - that 
is, if it can have one at all.

I suggest that a more technical definition of UNICODE could be a file format that can 
be read by programs that read UNICODE. This is pretty certain to be what a user 
understands by the word!

Now in the world of application programs intended for real human beings (as opposed, 
for example, to specialised technical tools), I cannot see that any program will 
survive for long if it cannot read, without user intervention, files written in all 
the self-describing Unicode formats (all those with a BOM). It follows that any of 
these formats could, with equal propriety, be described as UNICODE.

Moving back to output formats: this implies that the only requirement for a program 
that outputs data should be that if the user asks it to use UNICODE, the program uses 
one of the self-describing formats. The decision as to *which* of these formats to use 
would be up to the programmer. Depending on the circumstances, he may hard-wire a 
specific choice (perhaps whatever is best for the platform), or he may provide a 
configuration option accessible to more technical users.

Now, a question: 

Are there, in fact, many circumstances in which it is necessary for an end user to 
create files that do *not* have a BOM at the beginning?






Re: This spoofing and security thread

2002-02-14 Thread Martin Kochanski

At 23:43 13/02/02 -0600, David Starner wrote:
On Wed, Feb 13, 2002 at 08:46:31PM -0800, Yves Arrouye wrote:
  What do you mean? I've done works for Project Gutenberg, and looked at a
  number of books with thoughts of reducing them to ASCII. In my opinion,
  Windows-1252 has every character that most English books will need,
 
 Especially those books that you want to reduce to ASCII :-)

The very fact that most of them can be reduced to ASCII and people still
find the resulting text useful and accurate to the original is a sign
that the important characters in English are in ASCII.

And the fact that after reading those books a whole generation of English-speakers 
will go round Spain (or even the Californian school system) asking people ?cuantos 
anos tiene? and NOT get the answer they deserve shows, depending on your viewpoint, 
the patient forbearance of a noble race or the proper humility of a conquered people...





RE: This spoofing and security thread

2002-02-14 Thread Marco Cimarosti

Yves Arrouye wrote:
 Well, wouldn't you expect an American standard to properly encode the
 important characters for English? I would. Only ISO has the luxury of
 encoding Western Europe languages without catering properly 
 to French and some Nordic language (sorry, forgot which; as for
 French, I am referring to the lack of oe ligature in iso-8859-1).

Perhaps you are referring to the lack of letter š for Finnish. BTW, it also
lacks Ÿ for French. Thanks to euro, all this was fixed in ISO 8859-15:

A4 € EURO SIGN
A6 Š LATIN CAPITAL LETTER S WITH CARON
A8 š LATIN SMALL LETTER S WITH CARON
B4 Ž LATIN CAPITAL LETTER Z WITH CARON
B8 ž LATIN SMALL LETTER Z WITH CARON
BC ΠLATIN CAPITAL LIGATURE OE
BD œ LATIN SMALL LIGATURE OE
BE Ÿ LATIN CAPITAL LETTER Y WITH DIAERESIS

_ Marco




RE: Unicode and end users

2002-02-14 Thread Lars Kristan

Martin Kochanski wrote:
 Are there, in fact, many circumstances in which it is 
 necessary for an end user to create files that do *not* have 
 a BOM at the beginning?

AFAIK, UTF-8 files are NOT supposed to have a BOM in them.

Why is UTF-16 percieved as UNICODE? Well, we all know it's because UCS-2
used to be the ONLY implementation of Unicode. But there is another
important difference between UTF-16 and UTF-8. It is barely possible to
misinterpret UTF-16, because it uses shorts and not bytes. On the other
hand, UTF-8 and ASCII are in extreme cases identical.

Why not have BOM in UTF-8? Probably because of the applications that don't
really need to know that a file is in UTF-8, especially since it may be pure
ASCII in many cases (e.g. system configuration files). And if Unicode is THE
codeset to be used in the future, then at some point in time all files would
begin with a UTF-8 BOM. Quite unnecessary. Further problems arise when you
concat files or start reading in the middle.

To be honest, Unicode meaning UTF-16 and UTF-8 are fine with me. It's
what I am used to. For UNIX users UTF-8 is just like EUC or ISO-8859-x,
another codeset. The fact that it is universal does not mean it has to be
called Unicode, I think UTF-8 is just fine and equally (or more) useful. And
on UNIX, it is essential that the user is aware of the codeset that is being
used. I keep seeing files being used as examples. Think filesystems, file
names. File names would surely not start with a BOM, even if files would.
Suppose you have a script that will create some files, it is published on
the web, and you want to save it so you can run it. Now, it is up to you,
how to save it. If you use UTF-8 filenames, you do not want to save it as
some ISO, neither as just any Unicode, but precisely UTF-8. The shell will
execute the script and use byte sequences from the file to create filenames.

Now, an opposite example. You execute ls  ls.out, in a directory that has
some filenames (say, old files) in ISO and many others in UTF-8. What format
is the resulting file in? Well, since this is happening in the year 2016,
the editor will assume it's in UTF-8. We already agreed there are no BOM's
in files unless they are UTF-16, so the file must be UTF-8 just like
(almost) everything else is. Even if there BOM's would be used, should this
file have it? Anyway, some invalid sequences will be encountered by the
editor, but then hopefully it will simply display some replacement
characters (or ask if it can do so). Hopefully it will allow me to save the
file, with invalid sequences intact. Editing invalid sequences (or inserting
new ones) would be too much to ask, right?

What bothers me a little bit is that I would not be able to save such a file
as UTF-16 because of the invalid sequences in it. Why would I? Well, Windows
has more and more suppport for UTF-8, so maybe I don't really need to. I
still wish I had an option though.

This again makes me think that UTF-8 and UTF-16 are not both Unicode. Maybe
UTF-16 is 'more' Unicode right now, because of the past. But maybe UTF-8
will be 'more' Unicode in the future, because it can contain invalid
sequences and these can be properly interpreted by someone at a later time.
Unless UTF-16 has that same ability, it will lose the battle of being an
'equally good Unicode format'.


And why do I keep this in the Unicode and end users thread? Because
invalid sequences (and old filenames) are a fact that users WILL experience
and pretending that this is just a case of non-conformance is not in the
best interest of the users.


Lars Kristan
Storage  Data Management Lab
HERMES SoftLab





Unicode, Oh Unicode: lyrics

2002-02-14 Thread John Cowan

I can't make out the lyrics through my crappy speakers.  Are they
on line anywhere?

-- 
John Cowan [EMAIL PROTECTED] http://www.reutershealth.com
I amar prestar aen, han mathon ne nen,http://www.ccil.org/~cowan
han mathon ne chae, a han noston ne 'wilith.  --Galadriel, _LOTR:FOTR_





Re: Off-Topic (Re: This spoofing and security thread)

2002-02-14 Thread Elliotte Rusty Harold

Patrick Andries scripsit:

  Quite a feat indeed : since e accounts for 13% of letters in a typical
  English text.

Indeed.  It's called Gadsby, and the author of La disparition
certainly knew it.


Interesting. It appears to be online at http://gadsby.hypermart.net/. 
Lots of nasty pop-up ads there though.
-- 

+---++---+
| Elliotte Rusty Harold | [EMAIL PROTECTED] | Writer/Programmer |
+---++---+
|  The XML Bible, 2nd Edition (Hungry Minds, 2001)   |
|  http://www.ibiblio.org/xml/books/bible2/  |
|   http://www.amazon.com/exec/obidos/ISBN=0764547607/cafeaulaitA/   |
+--+-+
|  Read Cafe au Lait for Java News:  http://www.cafeaulait.org/  |
|  Read Cafe con Leche for XML News: http://www.ibiblio.org/xml/ |
+--+-+




Re: Unicode, Oh Unicode: lyrics

2002-02-14 Thread Roozbeh Pournader

On Thu, 14 Feb 2002, John Cowan wrote:

 I can't make out the lyrics through my crappy speakers.  Are they
 on line anywhere?

That's it:

Oh beautiful for Uni-Han,
for spacious User Zone!
For rampant scripts of India
and polar Nunavut!

Unicode, Oh Unicode
May all your code points
shine forever
and your beacon light the world!

Oh, marvelous for sixteen bits,
for precious surrogates!
For Bi-Di algorithm dear
and stalwart I-P-A!

Unicode, Oh Unicode
May all your code points
shine forever
and your beacon light the world!

Oh, glorious for Hangul fair,
for symbols mathematical!
For myriad exotic scripts
and punctuation we adore!

Unicode, Oh Unicode
May all your code points
shine forever
and your beacon light the world!

BTW, I was just wondering if a new version will be prepared for 4.0...

roozbeh





Re: Off-Topic (Re: This spoofing and security thread)

2002-02-14 Thread Elliotte Rusty Harold

At 11:59 PM -0500 2/13/02, John Cowan wrote:
There is an English translation (or translation): The Void,
wherein the hero, Anton Voyl, becomes Anton Vowl.  There are German
and Danish translations too.


Do you happen to know if these translations also avoid the letter e? 
German's especially impressive since I think e makes up about 20% of 
the letters in typical German.
-- 

+---++---+
| Elliotte Rusty Harold | [EMAIL PROTECTED] | Writer/Programmer |
+---++---+
|  The XML Bible, 2nd Edition (Hungry Minds, 2001)   |
|  http://www.ibiblio.org/xml/books/bible2/  |
|   http://www.amazon.com/exec/obidos/ISBN=0764547607/cafeaulaitA/   |
+--+-+
|  Read Cafe au Lait for Java News:  http://www.cafeaulait.org/  |
|  Read Cafe con Leche for XML News: http://www.ibiblio.org/xml/ |
+--+-+




Re: Off-Topic (Re: This spoofing and security thread)

2002-02-14 Thread Patrick Andries



Elliotte Rusty Harold wrote:

 At 11:59 PM -0500 2/13/02, John Cowan wrote:

 There is an English translation (or translation): The Void,
 wherein the hero, Anton Voyl, becomes Anton Vowl.  There are German
 and Danish translations too.


 Do you happen to know if these translations also avoid the letter e? 
 German's especially impressive since I think e makes up about 20% of 
 the letters in typical German.

16,7 % http://www.santacruzpl.org/readyref/files/g-l/ltfrqger.shtml

17,5% for French according to 
http://www.santacruzpl.org/readyref/files/g-l/ltfrqfr.shtml

13,1% for English 
http://www.santacruzpl.org/readyref/files/g-l/ltfrqeng.shtml

13,7% for Spanish 
http://www.santacruzpl.org/readyref/files/g-l/ltfrqsp.shtml

P. Andries






RE: Off-Topic (Re: This spoofing and security thread)

2002-02-14 Thread Hohberger, Clive

If my memory is correct, James Thurber also wrote a short (American English)
book called The Wonderful O in which he did not use the letter e.
Clive

-Original Message-
From: John Cowan [mailto:[EMAIL PROTECTED]]
Sent: Wednesday, February 13, 2002 10:59 PM
To: Patrick Andries
Cc: Asmus Freytag; Juliusz Chroboczek; [EMAIL PROTECTED]
Subject: Re: Off-Topic (Re: This spoofing and security thread)


Patrick Andries scripsit:

 Quite a feat indeed : since e accounts for 13% of letters in a typical 
 English text.

Indeed.  It's called Gadsby, and the author of La disparition
certainly knew it.

 There is also one in French where e accounts for 15,3% of letters in a 
 typical text
 
 It's called La disparition (320 pages without an e), by Georges 
 Perec. Extract
http://www2.ec-lille.fr/~book/perec/textes/disparition.shtml

There is an English translation (or translation): The Void,
wherein the hero, Anton Voyl, becomes Anton Vowl.  There are German
and Danish translations too.

-- 
John Cowan   http://www.ccil.org/~cowan  [EMAIL PROTECTED]
To say that Bilbo's breath was taken away is no description at all.  There
are no words left to express his staggerment, since Men changed the language
that they learned of elves in the days when all the world was wonderful.
--_The Hobbit_




Re: Unicode and end users

2002-02-14 Thread Juliusz Chroboczek

MK What we are trying to establish is the exact meaning that UNICODE
MK ought to have - that is, if it can have one at all.

In the Unix-like world, the term ``UTF-8'' has been used quite
consistently, and most documentation avoids using Unicode for a disk
format (using it for the consortium, er., the Consortium, the
character repertoire and, when useful, for the coded character set).

The Unix-like public is used to thinking of UTF-8 as the format in
which Unicode text is saved on disk, and ``UTF-8 (Unicode)'' or
perhaps ``Unicode (UTF-8)'' should be the preferred user-interface
item.

MK Are there, in fact, many circumstances in which it is necessary
MK for an end user to create files that do *not* have a BOM at the
MK beginning?

You should never use either BOMs or UTF-16 on Unix-like systems; using
either will break too much of the system.

Juliusz




Re: Unicode and end users

2002-02-14 Thread Keld Jørn Simonsen

On Thu, Feb 14, 2002 at 03:57:34PM +, Juliusz Chroboczek wrote:
 MK What we are trying to establish is the exact meaning that UNICODE
 MK ought to have - that is, if it can have one at all.
 
 In the Unix-like world, the term ``UTF-8'' has been used quite
 consistently, and most documentation avoids using Unicode for a disk
 format (using it for the consortium, er., the Consortium, the
 character repertoire and, when useful, for the coded character set).
 
 The Unix-like public is used to thinking of UTF-8 as the format in
 which Unicode text is saved on disk, and ``UTF-8 (Unicode)'' or
 perhaps ``Unicode (UTF-8)'' should be the preferred user-interface
 item.

I would rather recommend that you write ISO 10646 UTF-8 as the
ISO standard is a standard in many countries while Unicode is not.

Kind regards
keld




Re: This spoofing and security thread

2002-02-14 Thread Michael Everson

At 16:51 + 2002-02-14, Juliusz Chroboczek wrote:
   - a cross-reference of characters whose associated glyphs could be
   confused by a non-technical user;

ME Out of the entire standard? Who's going to do that for free? :-)

I don't know.  I'm not lobbying anyone here -- I'm just trying to
clarify why so many of us are finding it difficult to get to grips
with Unicode.

(Were you volunteering? ;-)

(Michael laughs out loud) Not for free. ;-)

Actually the annotations to the Unicode names list includes many such 
cross references.
-- 
Michael Everson *** Everson Typography *** http://www.evertype.com




Re: Unicode and end users

2002-02-14 Thread Doug Ewell

Lars Kristan [EMAIL PROTECTED] wrote:

 AFAIK, UTF-8 files are NOT supposed to have a BOM in them.

Different operating systems and applications have different preferences.
There is no universal right or wrong about this.  This is
unfortunate, but true.

 Why is UTF-16 percieved as UNICODE? Well, we all know it's because
UCS-2
 used to be the ONLY implementation of Unicode. But there is another
 important difference between UTF-16 and UTF-8. It is barely possible
to
 misinterpret UTF-16, because it uses shorts and not bytes. On the
other
 hand, UTF-8 and ASCII are in extreme cases identical.

At the risk of being mistaken for juuitchan by citing a Japanese
example:  A non-BOM file that starts with the bytes 0x30 0x42 could be
the UTF-8 characters 0B, or it could be the UTF-16BE character
HIRAGANA LETTER A.  (A similar situation applies for UTF-16LE.)  Now,
0B might not be the first two characters of many novels, but in a
techie Unix environment it could easily be the start of a text-format
data file.

Two common heuristics for determining whether a file is UTF-16 are to
check whether every other byte is 0x00, or whether every other byte is
the same.  The former fails for non-Latin scripts, the latter fails
(less frequently) for scripts that are not part of a smallish alphabet.

That's the problem with no BOM:  you have to resort to heuristics, or
external tagging.

 Why not have BOM in UTF-8? Probably because of the applications that
don't
 really need to know that a file is in UTF-8, especially since it may
be pure
 ASCII in many cases (e.g. system configuration files). And if Unicode
is THE
 codeset to be used in the future, then at some point in time all files
would
 begin with a UTF-8 BOM. Quite unnecessary. Further problems arise when
you
 concat files or start reading in the middle.

That's why U+2060 WORD JOINER is being introduced in Unicode 3.2.
Hopefully it will take over the ZWNBSP semantics from U+FEFF, which can
then be used *solely* as a BOM.  Eventually, if this happens, it will
become safe to strip BOM's as they appear.  (Of course, if you are
splitting or concatenating files, you should not do any interpretation
anyway.)

I have never seen a non-pathological example where stripping a file- or
stream-initial U+FEFF was harmful because of the possibility that it was
intended as ZWNBSP.  ZWNBSP (or WORD JOINER) affects the behavior of the
characters before and after it.  If there is no character before ZWNBSP,
it doesn't belong there.

 [O]n UNIX, it is essential that the user is aware of the codeset that
is being
 used.

Unix users are accustomed to dealing with such details.

 Anyway, some invalid sequences will be encountered by the
 editor, but then hopefully it will simply display some replacement
 characters (or ask if it can do so). Hopefully it will allow me to
save the
 file, with invalid sequences intact. Editing invalid sequences (or
inserting
 new ones) would be too much to ask, right?

 What bothers me a little bit is that I would not be able to save such
a file
 as UTF-16 because of the invalid sequences in it. Why would I? Well,
Windows
 has more and more suppport for UTF-8, so maybe I don't really need to.
I
 still wish I had an option though.

 This again makes me think that UTF-8 and UTF-16 are not both Unicode.
Maybe
 UTF-16 is 'more' Unicode right now, because of the past. But maybe
UTF-8
 will be 'more' Unicode in the future, because it can contain invalid
 sequences and these can be properly interpreted by someone at a later
time.
 Unless UTF-16 has that same ability, it will lose the battle of being
an
 'equally good Unicode format'.

I don't think the fact that invalid sequences are possible in UTF-8 and
not in UTF-16 makes UTF-8 inferior, or any less Unicode.  It was
designed that way.  Invalid sequences always represent a problem, just
like line noise.  They should not be treated as a normal situation.

-Doug Ewell
 Fullerton, California






Smiles, faces, etc

2002-02-14 Thread Falkor

This mailing list seems to be the first place for this, so...

There are two face characters in the Miscellaneous group.  Was wondering if
it would be appropriate to expand upon those two, possibly in its own block,
and add a series of smiles/faces/emoticons to the unicode standard.

Like 'em or hate 'em, those  :)  are here to stay.  ...and there's at
least twelve easily identifiable faces in common use on the internet.

Anyone have thoughts on this?

--Harry Davis





FW: This spoofing and security thread

2002-02-14 Thread jarkko . hietaniemi



-Original Message-
From: Hietaniemi Jarkko (NRC/Boston) 
Sent: Thursday, February 14, 2002 12:43
To: 'ext Marco Cimarosti'
Subject: RE: This spoofing and security thread


 Perhaps you are referring to the lack of letter š for Finnish. BTW, it also
 lacks Ÿ for French. Thanks to euro, all this was fixed in ISO 8859-15:

   A4 € EURO SIGN
   A6 Š LATIN CAPITAL LETTER S WITH CARON
   A8 š LATIN SMALL LETTER S WITH CARON
   B4 Ž LATIN CAPITAL LETTER Z WITH CARON
   B8 ž LATIN SMALL LETTER Z WITH CARON
   BC ΠLATIN CAPITAL LIGATURE OE
   BD œ LATIN SMALL LIGATURE OE
   BE Ÿ LATIN CAPITAL LETTER Y WITH DIAERESIS

Yup.  Strictly speaking, though, the caroned s and z are not needed for native
Finnish words, but they are needed for the proper spelling of few Finnishized
loanwords like

šakki  chess
šekki  cheque

and for the proper spelling of Finnish transliteration of Cyrillic names.
(The traditional workaround for not having the letters has been to use sh and zh.)

I think the caron versions also make the Sámi people happier.




RE: This spoofing and security thread

2002-02-14 Thread jarkko . hietaniemi

:- a map from characters to languages.
: 
: This has been attempted for some sets of latin based languages. I don't
: have a link to one of the documents that do that. Main problem is that
: many *more* characters are actually used (and used quite commonly) by users
: of these languages, than acknowledged by the makers of these lists.

http://www.eki.ee/letter/ looks reasonably extensive.






Re: Unicode and end users

2002-02-14 Thread David Starner

On Thu, Feb 14, 2002 at 05:46:46PM +0100, Keld Jørn Simonsen wrote:
 I would rather recommend that you write ISO 10646 UTF-8 as the
 ISO standard is a standard in many countries while Unicode is not.

*Grumble*. The whole point of this discussion is making it clear for the
users. Unicode is more clear for more users than ISO 10646 is. There is
no reason to use ISO 10646, besides pedanticness. 

-- 
David Starner / Давид Старнэр - [EMAIL PROTECTED]
Pointless website: http://dvdeug.dhis.org
What we've got is a blue-light special on truth. It's the hottest thing 
with the youth. -- Information Society, Peace and Love, Inc.




Re: Unicode and end users

2002-02-14 Thread Michael Everson

At 14:16 -0600 2002-02-14, David Starner wrote:

The whole point of this discussion is making it clear for the
users. Unicode is more clear for more users than ISO 10646 is. There is
no reason to use ISO 10646, besides pedanticness.

It is ISO/IEC 10646.
-- 
Michael Everson *** Everson Typography *** http://www.evertype.com




Re: Unicode and end users

2002-02-14 Thread Michael \(michka\) Kaplan

From: Michael Everson [EMAIL PROTECTED]
 At 14:16 -0600 2002-02-14, David Starner wrote:

 There is no reason to use ISO 10646, besides pedanticness.

 It is ISO/IEC 10646.

The defense rests.

MichKa

Michael Kaplan
Trigeminal Software, Inc.  -- http://www.trigeminal.com/





Re: Off-Topic (Re: This spoofing and security thread)

2002-02-14 Thread Barry Caplan

This was discussed in a book I recently read, called Code (don't recall the 
author right now). Apparently the Danish (I think) translation has an 
error, but only one. I guess the proof reader was not familiar with grep :)

Barry


At 08:23 AM 2/14/2003 -0500, Elliotte Rusty Harold wrote:
At 11:59 PM -0500 2/13/02, John Cowan wrote:
There is an English translation (or translation): The Void,
wherein the hero, Anton Voyl, becomes Anton Vowl.  There are German
and Danish translations too.

Do you happen to know if these translations also avoid the letter e? 
German's especially impressive since I think e makes up about 20% of the 
letters in typical German.
--

+---++---+
| Elliotte Rusty Harold | [EMAIL PROTECTED] | Writer/Programmer |
+---++---+
|  The XML Bible, 2nd Edition (Hungry Minds, 2001)   |
|  http://www.ibiblio.org/xml/books/bible2/  |
|   http://www.amazon.com/exec/obidos/ISBN=0764547607/cafeaulaitA/   |
+--+-+
|  Read Cafe au Lait for Java News:  http://www.cafeaulait.org/  |
|  Read Cafe con Leche for XML News: http://www.ibiblio.org/xml/ |
+--+-+





Re: Off-Topic (Re: This spoofing and security thread)

2002-02-14 Thread Michael Everson

At 17:23 -0500 2002-02-14, John Cowan wrote:

Well, the German translation also has one e in it --
Gib uns das tägliche Brot, and Perec apparently (the facts are
not quite certain) told someone that there *was* a single e
in the original -- he did not disclose its whereabouts.

Well, somebody go to Gutenberg and run a search.
-- 
Michael Everson *** Everson Typography *** http://www.evertype.com




Re: GB 18030 question

2002-02-14 Thread Yung-Fong Tang




I have additional question about GB18030

the following code point in GB18030 are map to Private Usaer Araea in Unicode
but have a glyph in the GB18030 standard. What does that mean ?

page 11 of GB180300xA6EC0xA6ED0xA6F30xA6D9 - 0xA6DFpage 81 of GB180300xFE50 - 0xFEA0ref- http://bugzilla.mozilla.org/show_bug.cgi?id=125407


Qingjiang (Brian) Yuan wrote:
[EMAIL PROTECTED]">
  Frank and Deborah,	After I saw the e-mail from Deborah, I asked our Beijing office tocontact the CESI. The follow is the information we got:--Have contacted with CESI. It is really a glyph bug. They have fixed it,but they did not notify us!CESI will not give us the updated fonts until tomorrow morning. It wassaid that there are serial glyph have been updated in the new version ofthe bitmap fonts.--Thanks.Brian.Yung-Fong Tang Wrote:
  
I looks like both Mac/Linux/Window N6.2 and current Mozilla map that toFFE3. Looks like IE on winXP do the same way.We, mozilla i18n group, got the GB18030 mapping table from sun. B Yuan,any comment?Michael Everson wrote:

  At 11:23 -0800 2002-02-01, Deborah Goldsmith wrote:
  
There is an error on page 10 of the GB 18030-2000 standard, in thatthe character with code point A3FE maps to U+FFE3 (FULLWIDTH MACRON),but is shown with a glyph that corresponds to U+FF5E (FULLWIDTHTILDE). The position of the character in its code block would alsoseem to indicate that tilde was intended.Does anyone have any idea of which should be considered correct, theglyph or the Unicode mapping value?

Glyphs are informative in JTC1. I can only assume that the GBstandards would follow suit.









Re: GB 18030 question

2002-02-14 Thread Qingjiang (Brian) Yuan


Yung-Fong Tang wrote:

 I have additional question about GB18030
 
 the following code point in GB18030 are map to Private Usaer Araea in 
 Unicode but have a glyph in the GB18030 standard. What does that mean ?
 

It means those characters/symbols are not in Unicode 3.0.
The following are the Characters that are not in Unicode 3.0 according
to the CESI:
GB18030
Unicode (Private Use Area)
 A8BCE7C7

 FE51E816
 FE52E817
 FE53E818
 FE59E81E
 FE61E826
 FE66E82B
 FE67E82C
 FE6CE831
 FE6DE832
 FE76E83B
 FE7EE843
 FE90E854
 FE91E855
 FEA0E864


But looks like there are more symbols that are not in Unicode 3.0.

Brian.


 page 11 of GB18030
 0xA6EC
 0xA6ED
 0xA6F3
 0xA6D9 - 0xA6DF
 
 page 81 of GB18030
 
 0xFE50 - 0xFEA0
 
 ref- http://bugzilla.mozilla.org/show_bug.cgi?id=125407
 
 
 
 Qingjiang (Brian) Yuan wrote:
 
Frank and Deborah,
  After I saw the e-mail from Deborah, I asked our Beijing office to
contact the CESI. The follow is the information we got:

--
Have contacted with CESI. It is really a glyph bug. They have fixed it,
but they did not notify us!

CESI will not give us the updated fonts until tomorrow morning. It was
said that there are serial glyph have been updated in the new version of
the bitmap fonts.
--

Thanks.
Brian.

Yung-Fong Tang Wrote:

I looks like both Mac/Linux/Window N6.2 and current Mozilla map that to
FFE3. Looks like IE on winXP do the same way.

We, mozilla i18n group, got the GB18030 mapping table from sun. B Yuan,
any comment?

Michael Everson wrote:

At 11:23 -0800 2002-02-01, Deborah Goldsmith wrote:

There is an error on page 10 of the GB 18030-2000 standard, in that
the character with code point A3FE maps to U+FFE3 (FULLWIDTH MACRON),
but is shown with a glyph that corresponds to U+FF5E (FULLWIDTH
TILDE). The position of the character in its code block would also
seem to indicate that tilde was intended.

Does anyone have any idea of which should be considered correct, the
glyph or the Unicode mapping value?


Glyphs are informative in JTC1. I can only assume that the GB
standards would follow suit.


 






Re: Unicode and end users

2002-02-14 Thread Asmus Freytag

At 09:22 AM 2/14/02 +, Martin Kochanski wrote:
Are there, in fact, many circumstances in which it is necessary for an end 
user to create files that do *not* have a BOM at the beginning?

In principle this is a requirement for data being labelled *external to the 
date* as being in either UTF-16BE or UTF-16LE (ditto for UTF-32). These 
formats *must not* have a BOM.

However, it may be the case in practice that protocols in which documents 
are labelled that way, don't accept separately edited documents, so this 
may be moot.

UTF-8 should *never* contain the BOM.
A./




Re: Smiles, faces, etc

2002-02-14 Thread Markus Scherer

Falkor wrote:

 Like 'em or hate 'em, those  :)  are here to stay.  ...and there's at


Probably, although the more people from outside the computer-tech world join in, the 
smaller percentage of people will use these, like my mother-in-law...

They are already encoded in Unicode, using two or more Unicode characters... using a 
colon and a closing parenthesis (I personally prefer the version with a dash nose) 
is all you need. There are a couple of real smileys too, but some modern emailers 
actually recognize the regular form and display an image.
If you replace the multi-character form, then you will break old software without much 
benefit.

markus


PS: ... and at the end of the day, Unicode is a _text_ encoding standard ... :-)





Re: Smiles, faces, etc

2002-02-14 Thread Patrick Andries



Markus Scherer wrote:

 Falkor wrote:

 Like 'em or hate 'em, those  :)  are here to stay.  ...and there's at



 Probably, although the more people from outside the computer-tech 
 world join in, the smaller percentage of people will use these, like 
 my mother-in-law...

 They are already encoded in Unicode, using two or more Unicode 
 characters... using a colon and a closing parenthesis (I personally 
 prefer the version with a dash nose) is all you need.


Methinks «We know what you need» is a bit patronizing.


 There are a couple of real smileys too, but some modern emailers 
 actually recognize the regular form

the « regular »... the contrived way you mean.

 and display an image. 

for what of a character.

 PS: ... and at the end of the day, Unicode is a _text_ encoding 
 standard ... :-) 


Yea, yea and this punctuation   ;-)  isn't text right ?  Why ? Because 
there is no character ;-)  ! Why ? Because people already have what they 
want ! And we know what they want.

Patrick






Re: Smiles, faces, etc

2002-02-14 Thread Patrick Andries



Patrick Andries wrote:


 There are a couple of real smileys too, but some modern emailers 
 actually recognize the regular form

 and display an image. 


 for what of a character.

I meant for want of a character.

P. Andries






Re: Smiles, faces, etc

2002-02-14 Thread David Starner

On Thu, Feb 14, 2002 at 08:56:25PM -0500, Patrick Andries wrote:
 They are already encoded in Unicode, using two or more Unicode 
 characters... using a colon and a closing parenthesis (I personally 
 prefer the version with a dash nose) is all you need.
 
 Methinks «We know what you need» is a bit patronizing.

That doesn't mean it's not right. There's a lot of absurd solutions
created by people with problems, and a lot of solutions to problems that
don't exist.
 
 There are a couple of real smileys too, but some modern emailers 
 actually recognize the regular form
 
 the « regular »... the contrived way you mean.

The regular way; the most common way; the way people actually use. 

 PS: ... and at the end of the day, Unicode is a _text_ encoding 
 standard ... :-) 
 
 Yea, yea and this punctuation   ;-)  isn't text right ?  Why ? Because 
 there is no character ;-)  ! 

See the FAQ. There's no character MALTESE IE, or SPANISH LL, either, but
they are still usuable in plain text. 

Unless Unicode is willing to dedicate several hundred characters to
these, there will be many similies that will be unencoded. And unless
Microsoft is willing to add it to their keyboards, most people won't be
able to use it directly. So once most systems support it - in what, 4-5
years? - programs may autoreplace the smilie. So IM's will send 3 bytes
across the net to replace three byte-sized ASCII characters, with the
same net effect, but having succesfully broken backward compatibility
with anybody using older hardware or software.

-- 
David Starner / Давид Старнэр - [EMAIL PROTECTED]
Pointless website: http://dvdeug.dhis.org
What we've got is a blue-light special on truth. It's the hottest thing 
with the youth. -- Information Society, Peace and Love, Inc.




Re: Smiles, faces, etc

2002-02-14 Thread Falkor

On 2/14/02 8:34 PM, Markus Scherer [EMAIL PROTECTED] wrote:

 They are already encoded in Unicode, using two or more Unicode characters...
 using a colon and a closing parenthesis (I personally prefer the version with
 a dash nose) is all you need.

The same could be said about dingbat arrows...  Like dash-greaterthan and
lessthan-dash-equal... Or superimposing a circumflex over a vertical bar.

The impulse to ask about this came about by using multiple emailers,
messaging systems, etc and having each interpret the faces and smiles
(emoticons) differently.  (not unlike a single hex code generating two
different characters on different operating systems) The sequence
colon-dash-X could be Kiss or Biting tongue, and Halo Angel has been
seen as O-colon-closeparen and openparen-A-closeparen.  [that is  :-X  O:)
and (A) respectively]

I was thinking more that this would allow modern software to translate a
lower-ASCII three-character sequence into a single unicode emoticon
character that would be displayed properly regardless of OS and software,
also alleviating the need for such developers to create proprietary artwork
for each.  This multiple-keystroke-per-character input method does have
precedent with Asian languages.

 If you replace the multi-character form, then you will break old software
 without much benefit.

Can't make an omelette without breaking eggs.  I'm sure Unicode as it is now
wreaks havoc on DOS apps  :)  ...but point taken.

 PS: ... and at the end of the day, Unicode is a _text_ encoding standard ...
 :-)

True enough.  But sometimes text without inflection can be a dangerous
thing.  This is what emoticons can address.  Besides, Dingbats and
Miscellaneous Symbols aren't exactly textual.  ...and if you can show me a
document written with the Box Drawing block, I'd be impressed.  :)

With all due respect,
--Harry






RE: Unicode and end users

2002-02-14 Thread Rick Cameron

Can you please expand on your statement that UTF-8 should never have a BOM?
Having one makes it very easy to distinguish a text file that contains UTF-8
from one that contains text in the system default MBCS encoding.

You may not be surprised to learn that Microsoft (or, at least, one of its
programmers) does not agree with you. When I save a file from Notepad on
Windows XP in UTF-8, the file contains a BOM.

(I have no connection with Microsoft - I'm just a programmer who has to
write code to import text files from time to time!)

Thanks

- rick cameron

-Original Message-
From: Asmus Freytag [mailto:[EMAIL PROTECTED]] 
Sent: Thursday, 14 February 2002 17:46
To: Martin Kochanski; [EMAIL PROTECTED]
Subject: Re: Unicode and end users


At 09:22 AM 2/14/02 +, Martin Kochanski wrote:
Are there, in fact, many circumstances in which it is necessary for an 
end
user to create files that do *not* have a BOM at the beginning?

In principle this is a requirement for data being labelled *external to the 
date* as being in either UTF-16BE or UTF-16LE (ditto for UTF-32). These 
formats *must not* have a BOM.

However, it may be the case in practice that protocols in which documents 
are labelled that way, don't accept separately edited documents, so this 
may be moot.

UTF-8 should *never* contain the BOM.
A./




Re: Smiles, faces, etc

2002-02-14 Thread David Starner

On Thu, Feb 14, 2002 at 10:28:19PM -0500, Falkor wrote:
 Miscellaneous Symbols aren't exactly textual.  ...and if you can show me a
 document written with the Box Drawing block, I'd be impressed.  :)

I don't have an example at hand, but if you dig up an old DOS shareware
disk and poke through the README files, it won't take that long to come
up with that one that used the Box Drawing characters in CP437. 

-- 
David Starner / Давид Старнэр - [EMAIL PROTECTED]
Pointless website: http://dvdeug.dhis.org
What we've got is a blue-light special on truth. It's the hottest thing 
with the youth. -- Information Society, Peace and Love, Inc.




RE: Unicode and end users

2002-02-14 Thread Yves Arrouye

 UTF-8 should *never* contain the BOM.

But has been pointed out, it is common practice for Microsoft, and also for
ICU's genrb tool, for example, which uses the BOM to autodetect the
encoding. The more example you'll see of that, the more people will use the
BOM (now, can't we all use -*- coding: utf-8 -*- ;-)?).

YA





Re: Smiles, faces, etc

2002-02-14 Thread David Starner

On Thu, Feb 14, 2002 at 10:55:04PM -0500, Patrick Andries wrote:
 The regular way; the most common way; the way people actually use. 
 
 Well, because there is no other way with a keyboard. But what do people 
 do with a pencil ? What is the way people actually draw smileys then ? 
 Tilted 90° ?

People add these things to written text? I've never seen it, and it
doesn't sound like you have, either.

 Unless Unicode is willing to dedicate several hundred characters to
 these, there will be many similies that will be unencoded.
 
 Which is obviously an argument to encode none (or only those that are 
 legacy). Now, granted the problem is to determine what is the set that 
 could be encoded and here ISO/Unicode hasn't got its work cut out for 
 itself : there is no prior approved set.

I misstated myself; the problem is not that the number is large, is that
it's openended. (-. is a valid smiley, as is :-;.

 I admit that there is a practical limitation as far as inputing these 
 characters is concerned, but then how many Unicode characters has 
 Microsoft (?) added to its [US ?] keyboard.

(Yeah, Microsoft. One heck of keyboard, though a little fragile for my
tastes. If I could just get one of the old steel keyboards with all the
bucky bits in a split layout . . .)

But I can enter LATIN CAPITAL LETTER HVAIR when I need it. People aren't
going to pull up the character map when they need a smiley - they'll
just type it in.
 
 So once most systems support it - in what, 4-5 years? - programs may 
 autoreplace the smilie.
 
 They already do. I'm not really sure I understand you. Are you aware 
 that I didn't need to use the «regular way» to get  ☺ and :-)  ?

One out of two ain't bad, I guess. That was garbage on the screens of
some of the subscribers, though - UTF-8 display is still not universal.

The point, though, was that it will take a year, maybe more, to
standardize the characters. It will take another couple years for new
systems to regularly provide fonts for them. And it will take yet
another couple years for people to have regularly upgraded their OS to
the newest system.

 Are we really obsessed about byte size ? The effect is not net : you 
 would now have characters which can take different appearances (font 
 variants if you want). They can  then be straight up (normal instead of 
 tilted), coloured or even animated.

Huh? If you want that, you're going to have to transmit inline graphics.
You can't animate glyphs in a font. You can color a current ASCII smiley
with HTML as easy as you can any new smiley, and a color drawing of a
face is just that, a color drawing, not text. 
 
 I wonder sometimes if the largest obstacle in the encoding of smileys as 
 characters is not the universal  normalization process itself.

The problem is, they are fundamentally ASCII text art, that appear only
in computer systems, and only there as ASCII text art. There's no prior
art to point to, except for systems that clearly display them as
graphical objects, not text.

-- 
David Starner / Давид Старнэр - [EMAIL PROTECTED]
Pointless website: http://dvdeug.dhis.org
What we've got is a blue-light special on truth. It's the hottest thing 
with the youth. -- Information Society, Peace and Love, Inc.




Re: Smiles, faces, etc

2002-02-14 Thread Patrick Andries



David Starner wrote:
[EMAIL PROTECTED]">
  
People add these things to written text? I've never seen it, and itdoesn't sound like you have, either.>

I wonder how you know this. I do write smileys on piece of papers.

[EMAIL PROTECTED]">
  

  Unless Unicode is willing to dedicate several hundred characters tothese, there will be many similies that will be unencoded.
  
  Which is obviously an argument to encode none (or only those that are "legacy"). Now, granted the problem is to determine what is the set that could be encoded and here ISO/Unicode hasn't got its work cut out for itself : there is no prior approved set.
  
  I misstated myself; the problem is not that the number is large, is thatit's openended. "(-." is a valid smiley,  as is ":-;".
  
Yes and so is the ideographic collection : it is open-ended.
  
  [EMAIL PROTECTED]">

  
So once most systems support it - in what, 4-5 years? - programs may autoreplace the smilie.

They already do. I'm not really sure I understand you. Are you aware that I didn't need to use the «regular way» to get  ☺ and :-)  ?

One out of two ain't bad, I guess. That was garbage on the screens ofsome of the subscribers, though - UTF-8 dispplay is still not universal.

Oh, I see, no Unicode characters now...lest old hardware breaks down, right
?  ;-) 
[EMAIL PROTECTED]">
  The point, though, was that it will take a year, maybe more, tostandardize the characters. It will take another couple years for newsystems to regularly provide fonts for them. And it will take yetanother couple years for people to have regularly upgraded their OS tothe newest system.
  
This applies to any new character.
  
  [EMAIL PROTECTED]">

  Are we really obsessed about byte size ? The effect is not net : you would now have characters which can take different appearances (font variants if you want). They can  then be straight up (normal instead of tilted), coloured or even animated.
  
  Huh? If you want that,
  
What ? A straight up smiley ? A bold smiley ? A different design ?
  [EMAIL PROTECTED]">
 you're going to have to transmit inline graphics.

No, that can be left to the receiving end (stylesheet, font settings, etc.).

Enough (for me).

P. Andries





RE: Unicode and end users

2002-02-14 Thread Tom Gewecke

Can you please expand on your statement that UTF-8 should never have a BOM?
Having one makes it very easy to distinguish a text file that contains UTF-8
from one that contains text in the system default MBCS encoding.

You may not be surprised to learn that Microsoft (or, at least, one of its
programmers) does not agree with you. When I save a file from Notepad on
Windows XP in UTF-8, the file contains a BOM.

It seems there are quite a few answers to these questions in the Unicode
utf-bom faq

http://www.unicode.org/unicode/faq/utf_bom.html

including mention of the Microsoft case and the fact that generally a BOM
can be used with any UTF.






Re: Smiles, faces, etc

2002-02-14 Thread David Starner

On Thu, Feb 14, 2002 at 11:48:04PM -0500, Patrick Andries wrote:
 People add these things to written text? I've never seen it, and it
 doesn't sound like you have, either.

 I wonder how you know this. I do write smileys on piece of papers.

I inferred that from your question about how people write them. I
apologize if that was a mistake inference.

 One out of two ain't bad, I guess. That was garbage on the screens of
 some of the subscribers, though - UTF-8 display is still not universal.
 
 Oh, I see, no Unicode characters now...lest old hardware breaks down, 
 right ? ;-)

If your goal to communicate, then you pick your tools wisely. Gratitious
use of Unicode smileys with people who may not be running the latest
system is not productive to communication. 

 The point, though, was that it will take a year, maybe more, to
 standardize the characters. It will take another couple years for new
 systems to regularly provide fonts for them. And it will take yet
 another couple years for people to have regularly upgraded their OS to
 the newest system.

 This applies to any new character.

True, and many people who might try to get a new character encoded think
again, and look for another solution. A character that is part of many
ancient classics is worth waiting to encode. An ephemeral character
like most smileys just isn't.
 
 Huh? If you want that,
 
 What ? A straight up smiley ? A bold smiley ? A different design ?

You have bold smileys. If you want animations, color.
 
 you're going to have to transmit inline graphics.
 
 No, that can be left to the receiving end (stylesheet, font settings, etc.).

But modern systems don't have the capablity to animate or color (in more
than one color) characters. That's graphics. 

For a proposal, you'd need examples of the character being used in
print, as a character and not a graphic. Do you have any examples?

-- 
David Starner / Давид Старнэр - [EMAIL PROTECTED]
Pointless website: http://dvdeug.dhis.org
What we've got is a blue-light special on truth. It's the hottest thing 
with the youth. -- Information Society, Peace and Love, Inc.




Re: Smiles, faces, etc

2002-02-14 Thread Patrick Andries



David Starner wrote:



For a proposal, you'd need examples of the character being used in
print, as a character and not a graphic. Do you have any examples?

On tourne en rond, as we say in French. What is a character and not a 
graphic for you ? Some « thing » that is already encoded as a character 
? A « thing » found among  (inline) printed text  ? A hand-written sign 
found mixed with other signs called letters or punctuation marks ?

Excuse me, if I do not go on with this thread.

Patrick Andries