Re: Unicode and Security

2002-02-10 Thread DougEwell2

In a message dated 2002-02-10 13:00:19 Pacific Standard Time, 
[EMAIL PROTECTED] writes:

> However, I do continue to maintain that character confusion is a real 
> security risk that will have real impact on users, and that needs to 
> be considered in any system that uses Unicode.

We have already established that similar-looking characters can cause 
confusion in Unicode-based systems.  However, we have also established that 
ISO 8859-1, 8859-5 (Cyrillic), 8859-7 (Greek), and even ASCII can suffer from 
this same problem.  It is unrealistic to sugest that the problem began with 
Unicode.

> In some domains the 
> problem might be severe enough to eliminate Unicode from 
> consideration in favor of less extensive character sets like Latin-1. 
> That would be a shame, but until the Unicode consortium addresses at 
> a root level the real security implications of their work, security 
> conscious developers will look elsewhere. (I notice the Unicode 3.0 
> book does not even have the word "security" in its index.) Many more 
> developers who are at best tangentially conscious of security issues 
> will go ahead and develop insecure systems because they don't realize 
> the security implications of adopting Unicode.

Companies and individuals that choose to throw out the baby with the bath 
water will achieve the kind of results that that approach usually delivers.

Companies and individuals that wish to establish their own definitions of, 
and policies for dealing with, confusable characters are free to do so.  As I 
stated earlier, and nobody could refute, there is no consistent way to 
determine which sets of characters are confusable with each other, other than 
in the most obvious cases like o/omicron.  So of course neither the Unicode 
Consortium nor WG2 has taken it upon themselves to draw up such a list.  This 
must be a local decision.

> Another possibility is a super-normalization that does combine 
> similar looking Unicode characters; e.g. in the domain name system we 
> might decide that microsoft.com with Latin o's or Cyrillic o's or 
> Greek o's is to resolve to the same address. No separate registration 
> would be necessary or possible. This would require detailed analysis 
> of the tens of thousands of Unicode characters allowed in domain 
> names by fluent speakers of various languages; not easy, not cheap, 
> but perhaps necessary. Besides, the security improvements, this 
> proposal would also improve the system's usability. Aren't sure 
> whether that URL on the bus used an o or an omicron? Doesn't matter, 
> type either one.

Adding this sort of unification to the nameprep stage might have been 
possible about a year or so ago.  It's probably too late now.

> Actually, people have been talking about the security problems with 
> HTML for years. Search engines have gone to some effort to eliminate 
> spamdexers that use these techniques. The log in HTML's eye does not, 
> however, negate the existence of the log in Unicode's eye.

Again (and again), the problem is not unique to Unicode.  Existing character 
sets also contain confusables.  Blaming Unicode for exacerbating the problem 
by offering so many characters is like blaming your local ice cream shop for 
offering 31 flavors, because that makes it so much more difficult to choose.

-Doug Ewell
 Fullerton, California
 (address will soon change to dewell at adelphia dot net)




Re: IPA keyboard

2002-02-09 Thread DougEwell2

I apologize in advance for replying in public to Michka's private message, 
but he asked a good question.

>> What I don't want:
>> - Anything that requires me to install Keyman
>
> How come? (Just curious).

I really don't want to install a Big Pre-Packaged Solution and have it do 
everything for me, Windows-wide.  What I'm looking for is a technical spec 
that I can use to build something from scratch for one application.

I'd rather not go through the effort of installing Keyman, especially on the 
memory- and disk-challenged machine I'm using here at home, just so I can 
load a single keyboard layout, reverse-engineer it, and uninstall Keyman.  
That's all I'd really be doing with it, at least for now.

I know there are a lot of Keyman devotees on the list, so if I am greatly 
exaggerating the effort vs. payoff, please let me know in a gentle, 
flame-free way.  Also, if nobody is able to come up with a text description 
of the type I wanted, I may have to resort to the Keyman approach anyway.

James Kass wrote:

> Here is an actual layout for IPA UTF-8 entry:
> http://www.elgin.free-online.co.uk/ipa_kb_det.htm

This is weird.  Quoting from the page, "Since Unicode UTF-8 encoding codes 
each IPA symbol as two characters (bytes), you will have to type two keys for 
each letter."  Eventually it is revealed that the two-character sequences the 
user must type are "based on the SAMPA guidelines for typing IPA using 
ASCII."  No, this one isn't what I want.

> This page has graphics showing the Mac-IPA layout:
> http://www.matchfonts.com/pages/m-ipa.html

I had seen this page before.  This is more like what I had in mind, but it 
requires five keyboard states.  I know that full IPA support may require more 
than 47 * 4 = 182 characters, but I really have to stick to this limit.  I 
could tolerate supporting only the 182 "most common" IPA characters (whatever 
that means) if necessary.

Also, the Mac-IPA layout is presented as a bitmap only, without Unicode code 
points or even character names.  I'm not familiar enough with IPA to be able 
to distinguish, say, U+0279 from U+027A by looking at smallish bitmaps.

But I do appreciate James's effort in looking up these two resources and 
letting me know about them.

-Doug Ewell
 Fullerton, California
 (address will soon change to dewell at adelphia dot net)




Re: Unicode and Security

2002-02-09 Thread DougEwell2

In a message dated 2002-02-09 13:00:59 Pacific Standard Time, 
[EMAIL PROTECTED] writes:

> It seems to me that this problem really needs some other fix than the
> merging of all similar-looking characters in all character sets. I
> just can't see that working. 

Even the "merging" part wouldn't work.  Let's say that I, like Ken Sakamura 
or Bernard Miller before me, have decided that I know much more about 
character encoding than the Unicode Consortium or WG2, and I am going to 
develop my own character encoding that will solve the problem of confusables 
once and for all.

OK, we start with the easy ones.  Latin A, Greek Alpha, and Cyrillic A all 
get unified.  Latin E, Greek Epsilon, Cyrillic E, unified.  Hey, this is 
easier than I thought.  Latin B, Greek Beta, Cyrillic Ve.  Ha!  I'm smart 
enough to know that Ve gets unified with B and Beta, even though it 
represents a different sound.  Just like Han unification!  Boy, those Unicode 
dolts really missed something there.

Let's keep going.  Latin Y, Greek Upsilon, Cyrillic U.  Wait a minute, that 
Cyrillic U doesn't look *quite* the same.  Oh well, it's close enough, right? 
 Let's try some lower-case letters.  Latin a, Greek alpha, Cyrillic a.  That 
Greek alpha looks kinda cursive, doesn't it?  Should we unify it or not.  
Hmmm...

How about Latin n and Greek eta?  Is that descender on the eta significant or 
not?  Hey, you could stick an eta in the middle of a Web address and really 
fool somebody.  Better unify.  How about Latin v and Greek nu?  Different 
glyphs or not?  In 9-point MS Sans Serif, they're pretty close, aren't they?  
(And don't forget Armenian vo!)  Same goes for Latin y and Greek gamma.

Well, you get the point.  The world of alphabetic confusables is just not 
that simple or that 1-to-1.  There are more edge cases, in fact, than obvious 
cases such as the a/alpha or o/omicron that we keep hearing about.  And if I 
were trying to design this hypothetical "Uniglyph" encoding to get rid of 
those pesky confusables, and still provide support for alphabetic scripts 
besides Latin, I would eventually have to face the fact that it *can't be 
done*.  Oh, sure, it can be done for a/alpha and o/omicron, so I can make a 
sales presentation or a picket sign.  But a complete technical solution, uh, 
no.

-Doug Ewell
 Fullerton, California
 (address will soon change to dewell at adelphia dot net)




IPA keyboard

2002-02-09 Thread DougEwell2

I am looking for information on IPA keyboards.  I would like to build a 
keyboard for SC UniPad that would allow the user to type IPA characters 
directly.

What I want:
- Text.  (Could be plain text, PDF, Word, Excel, etc.)
- Keys referenced by ISO 9995, scan codes, or U.S. English assignment
- Characters referenced by Unicode values, or at least SGML entities
- Preferably no more than four (4) discrete keyboard states

Linux keymaps are fine if they meet the above requirements.

What I don't want:
- Graphic images without a corresponding text description
- Anything that requires me to install Keyman
- Anything related to "ASCII IPA"

Auy such information would be appreciated.

Thank you,

-Doug Ewell
 Fullerton, California
 (address will soon change to dewell at adelphia dot net)




Re: Unicode and Security: Domain Names

2002-02-08 Thread DougEwell2

In a message dated 2002-02-08 8:23:22 Pacific Standard Time, 
[EMAIL PROTECTED] writes:

>> Does anyone know anything about RACE encoding and its properties?
>
> I wrote an article on IDNS in December of 2000 which discusses the
> approaches which were being debated at that time, including RACE. RACE
> is briefly described in that article. You can find it at:
>
> http://www-106.ibm.com/developerworks/library/u-domains.html
>
> I tried to find an updated internet draft on RACE, but looks like
> nothing exists after version 4, which has been archived. I'm guessing
> that draft names wich include the text BRACE, TRACE, and GRACE are
> probably RACE variations however. Check them out at:
> http://www.ietf.org/internet-drafts/ 

An ACE (ASCII-Compatible Encoding) has been chosen for IDN, and it is neither 
RACE nor DUDE.  Its working name was AMC-ACE-Z, and it has since been renamed 
Punycode.  (No, I don't like the name either.)

A search for "punycode" in the internet-drafts directory that Suzanne 
mentioned will reveal the details you are looking for.

Beware that in addition to Punycode, there is another step in the IDN process 
called "nameprep," which is basically an extended form of normalization to 
keep compatibility characters, non-spacing marks, directional overrides, and 
such out of domain names.  Converting an arbitrary string through Punycode 
does not necessarily make it IDN-ready.

-Doug Ewell
 Fullerton, California
 (address will soon change to dewell at adelphia dot net)




Re: [idn] RE: Unicode and Security

2002-02-07 Thread DougEwell2

[EMAIL PROTECTED] observed:

> Analogously, people will keep opening executable attachments promising sex,
> regardless of whether the 's', 'e', and 'x' are Latin letters or not.

They're not, of course:  U+0455 U+0435 U+0445

-Doug Ewell
 Fullerton, California
 (address will soon change to dewell at adelphia dot net)




Re: Key E00 (was: (no subject))

2002-02-06 Thread DougEwell2

In a message dated 2002-02-06 3:39:14 Pacific Standard Time, 
[EMAIL PROTECTED] writes:

> "ISO" keyboards have the section-sign (§) key, next to the 1 key 
> above the tab key on the left of the keyboards. Some US keyboards 
> (for instance the Mac PowerBook G3) don't have this key, but instead 
> have the grave key there, while on the "ISO" keyboard the grave key 
> is down next to the z.

My draft copy of ISO/IEC 9995-3, acquired from:

http://iquebec.ifrance.com/cyberiel/sc35wg1/SC35N0233_9995-3.pdf

shows SECTION SIGN on key C02, level 2 of the common secondary group, and 
GRAVE ACCENT on key C12, level 1 on both the complementary Latin and common 
secondary groups.  (Note that C12 is frequently relocated to B00, down next 
to the 'z' as you indicated.)

In the complementary Latin group, key E00 is ASTERISK (level 1) and PLUS SIGN 
(level 2), while in the common secondary group it is NOT SIGN (level 1) and 
SOFT HYPHEN (level 2).

Which "ISO" keyboard are you referring to?  I'm not trying to be 
argumentative; I just got done implementing a lot of keyboards, and none of 
them had SECTION SIGN on key E00, so I'm curious.

For those unfamiliar with ISO 9995 terminology, please refer to the above 
document as well as:

http://iquebec.ifrance.com/cyberiel/sc35wg1/SC35N0232_9995-2.pdf

and John Cowan's explanation from yesterday.

-Doug Ewell
 Fullerton, California
 (address will soon change to dewell at adelphia dot net)




Re: Cherokee accent

2002-02-06 Thread DougEwell2

Here is the response I got from the Cherokee Nation, to whom I cc'd my 
original question about the Cherokee accent mark.

So, is this a candidate for encoding?

-8<-begin forwarded message-8<-

>  The accent is to be used on the syllable with the accent when pronouncing 
>  the word, just like an accent is used in the pronunciation key of an 
English 
> 
>  word.
>  
>  Thank you for your inquiry,
>  LISA
>  wadulisi
>  
>  Name: Lisa Stopp > wadulisi dinalewisda
>  Resource Coordinator for the Arts
>  Cultural Resource Center
>  Cherokee Nation
>  PO Box 948, Tahlequah, OK 74465
>  918-458-6170
>  fax 918-458-6172
>  E-mail: Lisa Stopp <[EMAIL PROTECTED]>
>  Date: 02/07/2002
>  Time: 10:19:54

-Doug Ewell
 Fullerton, California
 (address will soon change to dewell at adelphia dot net)

--- Begin Message ---

The accent is to be used on the syllable with the accent when pronouncing 
the word, just like an accent is used in the pronunciation key of an English 
word.

Thank you for your inquiry,
LISA
wadulisi

Name: Lisa Stopp > wadulisi dinalewisda
Resource Coordinator for the Arts
Cultural Resource Center
Cherokee Nation
PO Box 948, Tahlequah, OK 74465
918-458-6170
fax 918-458-6172
E-mail: Lisa Stopp <[EMAIL PROTECTED]>
Date: 02/07/2002
Time: 10:19:54





--- End Message ---


Re: Key E00 (was: (no subject))

2002-02-06 Thread DougEwell2

In a message dated 2002-02-05 10:08:54 Pacific Standard Time, 
[EMAIL PROTECTED] writes:

>> For those of us not in the know, please tell us what the heck "key E00, 
level
>> 1" means.
>
> It is the section-sign (§) key, next to the 1 key above the tab key 
> on the left. Some US keyboards don't have this key, but instead have 
> the grave key there.

I don't know of any U.S. keyboards that have U+00A7 in that location.

I thought maybe this was a characteristic of Irish or Gaelic keyboards, but 
Microsoft's keyboard site doesn't show it there either.  Is this a Macintosh 
convention?

-Doug Ewell
 Fullerton, California
 (address will soon change to dewell at adelphia dot net)

P.S.  sorry for the "no subject" header in my original message.




(no subject)

2002-02-05 Thread DougEwell2

On the official Web site of the Cherokee Nation (Tahlequah, Oklahoma), there 
is a Cherokee keyboard, there is a nice keyboard layout that goes with the 
font they offer:

http://www.cherokee.org/Extras/downloads/font/Keyboard.htm

For key E00, level 1 (i.e. the unshifted grave-accent key), there is a little 
squiggly mark called "Accent."  I can't find any indication of the purpose of 
this character -- what it's supposed to accent -- but it's not encoded in 
Unicode.

Does anyone know what this character is for, or why it wasn't encoded?  I 
read Michael Everson's 1995 proposal for Cherokee (WG2 N1172) and couldn't 
find any mention of it.

-Doug Ewell
 Fullerton, California
 (address will soon change to [EMAIL PROTECTED])




Re: "plain text" and plane 14 lang tags

2002-02-05 Thread DougEwell2

In a message dated 2002-02-04 9:07:22 Pacific Standard Time, 
[EMAIL PROTECTED] writes:

>> In plain text, I think that plane 14 language tags could be used
>
> It seems to me that such usage confuses the meaning of "plain text". Use 
> of the plane 14 tagging characters to indicuate language would be markup 
> -- metadata that is separate from the content and that has some impact on 
> how the content should be processed.

I'm afraid this is one place where Peter and I are forever destined to 
disagree.  While Plane 14 tags do perform a markup-like function -- just as 
the directional overrides and variation selectors do -- they are discrete 
Unicode characters, and so, by definition, they are plain text.  From TUS 
3.0, page 16:  "The Unicode Standard encodes plain text."

> It's just a coincidence that the 
> markup uses distinct characters from the content.

It's not a coincidence at all.  Plane 14 in general, and the specific code 
points in particular, were intentionally chosen to ensure that the tag 
characters would not conflict with any other characters.

In HTML, the string "" -- a sequence of ordinary ASCII 
characters -- has a special, higher-level meaning that is defined by the 
markup language.  In another context, that string might not have the same 
meaning; another string might convey that meaning, or there might not be any 
such markup available.

By contrast, the Unicode sequence U+E0001 U+E0078 U+E0068 has only one 
meaning, defined by the character encoding standard as clearly as it defines 
the letter A (if not more so).

-Doug Ewell
 Fullerton, California
 (address will soon change to [EMAIL PROTECTED])




Re: Old Hungarian

2002-01-31 Thread DougEwell2

In a message dated 2002-01-31 20:20:33 Pacific Standard Time, 
[EMAIL PROTECTED] writes:

> Does anyone know if anything happened since the last
> proposal in 1988 to include Old Hungarian

Actually 1998.  But yes, I was wondering about the status of the rovásírás as 
well.

-Doug Ewell
 Fullerton, California




Re: Beta version

2002-01-30 Thread DougEwell2

In a message dated 2002-01-30 21:48:47 Pacific Standard Time, 
[EMAIL PROTECTED] writes:

> Is there someplace where we, the unwashed masses, have access to these
> documents?

Yeah.  Good question.  I've found some of them myself, in particular the code 
charts, by poking around the WG2 site at dkuug.dk and in other places.  If 
they're on the public Internet, I have every right to see them and download 
them, but they clearly weren't put there for that purpose.

-Doug Ewell
 Fullerton, California




Re: Unicode Search Engines

2002-01-30 Thread DougEwell2

In a message dated 2002-01-30 7:28:36 Pacific Standard Time, 
[EMAIL PROTECTED] writes:

> It is not a 'fatal flaw'.

I didn't say it was.  I meant to say, but wasn't clear enough in doing so, 
that on other mailing lists the tendency is to blame Unicode for any problem 
or inconvenience in character handling.  (You should know which one I mean, 
Mark; you're on it. :)

-Doug Ewell
 Fullerton, California




Re: Unicode Search Engines

2002-01-29 Thread DougEwell2

In a message dated 2002-01-28 7:37:48 Pacific Standard Time, 
[EMAIL PROTECTED] writes:

> I would like to add:
> How do they handle normalization?
> In Vietnam, many characters can be represented in several different ways:
> (1) fully precomposed (NFC)
> (2) base character and modifier precomposed, tonal mark combining
> (3) base character, then modifier, then tonal mark
> (4) like (3), but modifier and tonal mark sorted (NFD)
> Do the search engines do any normalization, before indexing a page?
> Are queries normalized before running the search?

I'm not sure what sort of normalization might be performed by search engines, 
but I want to examine the Vietnamese decomposition aspect for a moment.

If you have a Vietnamese vowel with both modifier and tone mark, say LATIN 
CAPITAL LETTER A WITH CIRCUMFLEX AND ACUTE, then you can represent this in 
Unicode in at least three ways:

(1) fully precomposed (NFC) -- that is, U+1EA4
(2) base character and modifier precomposed, tonal mark combining -- that is, 
U+00C2 U+0301
(3) base character, then modifier, then tonal mark -- that is, U+0041 U+0302 
U+0301

So far, so good.  But then we have:

> (4) like (3), but modifier and tonal mark sorted (NFD)

If "sorting" the diacritical marks in NFD results in rearranging the two 
diacritical marks -- in this case, U+0041 U+0301 U+0302 -- then in terms of 
Vietnamese orthography, the NFD form may not really be a legitimate way of 
representing the Vietnamese letter.

For example, U+1EAC LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND DOT BELOW is, 
in Vietnamese, a circumflexed A to which a tone mark (dot below) has been 
added.  It is not a dotted-below A to which a circumflex has been added.  Yet 
because of the canonical combining classes of the two diacriticals (230 for 
COMBINING CIRCUMFLEX ACCENT, 220 for COMBINING DOT BELOW), the latter is how 
the character will be decomposed.

In theory, there is actually a case 5: base character and tonal mark 
precomposed, modifier combining.  In terms of Vietnamese orthography, this is 
just as illegitimate as case 4 (NFD), but most software that processes 
Vietnamese text will probably never encounter it.  But it will have to handle 
the NFD case.

If I were on some other mailing lists I could think of, I would claim that 
this is a fatal flaw in the design of Unicode Normalization Form D.  It's 
not, but it is a sticky problem that needs to be dealt with when dealing with 
Vietnamese text.

-Doug Ewell
 Fullerton, California




Re: Unicode Search Engines

2002-01-29 Thread DougEwell2

In a message dated 2002-01-28 7:37:48 Pacific Standard Time, 
[EMAIL PROTECTED] writes:

> I would like to add:
> How do they handle normalization?
> In Vietnam, many characters can be represented in several different ways:
> (1) fully precomposed (NFC)
> (2) base character and modifier precomposed, tonal mark combining
> (3) base character, then modifier, then tonal mark
> (4) like (3), but modifier and tonal mark sorted (NFD)
> Do the search engines do any normalization, before indexing a page?
> Are queries normalized before running the search?

I'm not sure what sort of normalization might be performed by search engines, 
but I want to examine the Vietnamese decomposition aspect for a moment.

If you have a Vietnamese vowel with both modifier and tone mark, say LATIN 
CAPITAL LETTER A WITH CIRCUMFLEX AND ACUTE, then you can represent this in 
Unicode in at least three ways:

(1) fully precomposed (NFC) -- that is, U+1EA4
(2) base character and modifier precomposed, tonal mark combining -- that is, 
U+00C2 U+0301
(3) base character, then modifier, then tonal mark -- that is, U+0041 U+0302 
U+0301

So far, so good.  But then we have:

> (4) like (3), but modifier and tonal mark sorted (NFD)

If "sorting" the diacritical marks in NFD results in rearranging the two 
diacritical marks -- in this case, U+0041 U+0301 U+0302 -- then in terms of 
Vietnamese orthography, the NFD form may not really be a legitimate way of 
representing the Vietnamese letter.

For example, U+1EAC LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND DOT BELOW is, 
in Vietnamese, a circumflexed A to which a tone mark (dot below) has been 
added.  It is not a dotted-below A to which a circumflex has been added.  Yet 
because of the canonical combining classes of the two diacriticals (230 for 
COMBINING CIRCUMFLEX ACCENT, 220 for COMBINING DOT BELOW), the latter is how 
the character will be decomposed.

In theory, there is actually a case 5: base character and tonal mark 
precomposed, modifier combining.  In terms of Vietnamese orthography, this is 
just as illegitimate as case 4 (NFD), but most software that processes 
Vietnamese text will probably never encounter it.  But it will have to handle 
the NFD case.

If I were on some other mailing lists I could think of, I would claim that 
this is a fatal flaw in the design of Unicode Normalization Form D.  It's 
not, but it is a sticky problem that needs to be dealt with when dealing with 
Vietnamese text.

-Doug Ewell
 Fullerton, California




Re: Variation Selection

2002-01-27 Thread DougEwell2

In a message dated 2002-01-27 18:51:35 Pacific Standard Time, 
[EMAIL PROTECTED] writes:

> First, have we all servers?

No.  Assuming we all do is no better than assuming we all have broadband or 
T1 connections.

> Second, if it is a small attachment, the number of bytes transmitted to GO 
> there may well be greater than the number to send the bloody thing. How 
> about an arbitrary attachment limit of 1535 
>(Œê˜C‡‚킹F‚Ђ߂݂±) 
bytes?

I just sent a 500-byte attachment to the Unicode list without any qualms.  
Almost all messages are longer than 500 bytes, especially when you take 
headers into account.  However, there is certainly such a thing as "too 
long," and I have no idea where to draw the line and say "this attachment is 
short enough, that one is too long."

-Doug Ewell
 Fullerton, California




Re: The benefit of a symbol for 2 pi

2002-01-26 Thread DougEwell2

In a message dated 2002-01-26 19:58:28 Pacific Standard Time, 
[EMAIL PROTECTED] writes:

>> Double-struck pi!  What better symbol to represent 2 * pi?
>
> These double struck symbols are used by mathematical sofware
> precisely because they are NOT yet used for regular operators
> or variables. Please don't make such recommendations before
> understanding the nature of the symbol you are suggesting
> to abuse!

Sorry.  I probably should have guessed that they were being added to 3.2, to 
the BMP no less, for a specific reason.

-Doug Ewell
 Fullerton, California




Re: The benefit of a symbol for 2 pi

2002-01-26 Thread DougEwell2

One of the new characters scheduled for Unicode 3.2 is

U+213F DOUBLE-STRUCK CAPITAL PI

(A 500-byte GIF is attached.)

Double-struck pi!  What better symbol to represent 2 * pi?

-Doug Ewell
 Fullerton, California




Re: POSITIVELY MUST READ! Bytext is here!

2002-01-26 Thread DougEwell2

In a message dated 2002-01-26 5:46:37 Pacific Standard Time, 
[EMAIL PROTECTED] writes:

> I'm surprised you all went for it.

Well, I was curious, sort of like a passing motorist who slows down to see 
the aftermath of a collision.

The same thing happened when someone announced that TRON was going to destroy 
Unicode and take over the world.  I had to take a look, at least.

One of my favorite parts of Bytext was the section on Emoticons.  Certainly, 
one thing that a "serious competitor" to Unicode must have is "a rich set of 
emoticons as single characters."  I've always felt the UTC was badly out of 
touch with the user community by neglecting to encode TIRED-BORED FACE, QUESY 
[sic] FACE, YUKKY FACE, and DROOLING FACE.

-Doug Ewell
 Fullerton, California




Re: POSITIVELY MUST READ! Bytext is here!

2002-01-25 Thread DougEwell2

In a message dated 2002-01-25 20:45:46 Pacific Standard Time, 
[EMAIL PROTECTED] writes:

> Unicode now has a serious competitor. Please read
> about it at www.bytext.org. Everyone on this list
> should find it extremely interesting. 

I just downloaded the PDF file and spent about 10 minutes skimming through 
it.  This is a joke, right?

-Doug Ewell
 Fullerton, California




Re: TC/SC mapping

2002-01-24 Thread DougEwell2

Many have responded:

> Meanwhile, it is true that there are simplified characters which 
> correspond to more than one traditional form.
...
> This is the kind of mess that has discouraged anybody from doing a 
> systematic survey of simplifications for the Unihan database.
...
> Before converting TC to SC, one should resolve all TC variants to
> the most "common" or "standard" TC form (good luck deciding what that
> means).
...
> I think that any mapping will fail.

Thanks to everyone for your input concerning the TC/SC mapping issue.  You 
have confirmed what I already knew, but needed concrete evidence of; namely, 
that mapping between Traditional Chinese and Simplified Chinese is not a 
simple 1-to-1 table lookup problem, but involves lexical analysis and even 
knowledge of the author's intent.

Currently on the IDN mailing list there is a big debate over this topic.  It 
is well known that ASCII-based domain names are matched in the DNS in a 
case-insensitive manner.  Many people recognize that Chinese readers who are 
familiar with both TC and SC consider text written in the two sub-scripts to 
be interchangeable, in roughly the same way that uppercase and lowercase 
Latin are interchangeable.  They would like Chinese domain names written in 
TC to match the "equivalent" name written in SC, just as "UNICODE.ORG" 
matches "unicode.org".

The problem is getting people to understand the scope of the problem.  As you 
have illustrated so well, TC/SC mapping is NOT, in the general case, as 
simple as Latin case mapping.  It requires content analysis, and possibly 
some form of tagging.

Almost all of the list members whose e-mail addresses end in .cn, .tw or .hk 
seem to believe that there is a willful disregard on the part of the working 
group for the needs of Chinese users in this respect.  We have tried to 
convince them that (a) the solution is not as simple as Latin case mapping, 
as many have portrayed it; (b) the problem is not with Unicode Han 
unification, since TC and SC are not unified; (c) content analysis is not 
feasible for domain names; and (d) the entire problem is out of scope of the 
IDN WG.  We have proposed that organizations register both .cn 
and .cn if they want both hits to be successful.  So far, not 
much convincing has taken place.  In the above case, they claim that all 
eight (2^3) possible combinations (e.g. ".cn") would need to be 
registered, which is overkill.

One list member has even proposed the prohibition of all CJK code points from 
internationalized domain names "until the problem can be solved," and he has 
the support of several others.  It is obvious that this is an attempt to 
hijack the entire IDN model by claiming "it does not support Chinese at all," 
which would certainly be true if Han characters were prohibited, and imposing 
a locally-constructed, Chinese-specific (i.e. not universal) model later on.

Unfortunately, as an American who does not speak or read Chinese, I have been 
in a poor position to argue with these people about their own written 
language.  So I relied on the combined expertise of the Unicode list, 
including native speakers and people with doctorates in Chinese, for 
background information.  Thanks again for your help.

-Doug Ewell
 Fullerton, California




Re: [Very-OT] Re: ü

2002-01-23 Thread DougEwell2

In a message dated 2002-01-23 13:32:39 Pacific Standard Time, 
[EMAIL PROTECTED] writes:

>> Webster's New World Dictionary of the American Language
>> lists only (fil' här män' ik).
>
> BTW, are those two a's really identical?

They are in my dialect, a mixture of Southern California and Great Lakes, but 
not in some others.  For example, they would be different in British RP.

By the way... (desperate attempt to get this thread back on-topic)

- the first is U+10402 (or U+1042A)
- the second is U+10409 (or U+10431)

The biggest problem I had learning the Deseret Alphabet was figuring out the 
difference between these two vowels, especially they're the same to me.  Now 
I decide on the basis of how I think the vowels would be pronounced in RP, so 
"philharmonic" is spelled:

10441 1042E 1044A 10438 1042A 10449 1044B 10431 1044C 1042E 1043F

(Yes, I pronounce both the h and the r.)

-Doug Ewell
 Fullerton, California




U+2623 BIOHAZARD SIGN

2002-01-21 Thread DougEwell2

The glyph for U+2623 should be revised.  The international biohazard symbol 
is well-defined and subject to little variation.  In particular, the "lobster 
claws" pointing to 12:00, 4:00 and 8:00 should be perfectly round, or almost 
so.  The glyph published in the Unicode Standard looks more like a bird's-eye 
view of three dolphins feeding.

For a good rendition of U+2623, see:

http://www.subvertise.org/details.php?code=213

minus the enclosing triangle.

-Doug Ewell
 Fullerton, California




Re: Unicode 3.2 Beta Period Finishing

2002-01-21 Thread DougEwell2

In a message dated 2002-01-21 19:45:47 Pacific Standard Time, 
[EMAIL PROTECTED] writes:

> We are fast approaching the end of the beta period for Unicode 3.2,
> completing at the end of this week. The following data files have been
> changed in January:
...
>  StandardizedVariants..> 21-Jan-2002 18:0537k

Many of the embedded images in the Standardized Variants document are missing.

Will the emergence of standardized variants, using variation selectors, 
prevent such abominations as the Korean fractions from being explicitly 
encoded?  (IIRC, there was a proposal from one of the Koreas to encode 
duplicates of the vulgar fractions with a horizontal bar instead of diagonal.)

-Doug Ewell
 Fullerton, California




Re: The benefit of a symbol for 2 pi

2002-01-21 Thread DougEwell2

In a message dated 2002-01-21 17:32:49 Pacific Standard Time, 
[EMAIL PROTECTED] writes:

> Capital pi is to product as capital sigma is to summation.

Well, then it certainly can't be used for 6.283185... as well.  The search 
continues.

-Doug Ewell
 Fullerton, California




SCSU (was: Re: Devanagari)

2002-01-21 Thread DougEwell2

In a message dated 2002-01-21 5:20:55 Pacific Standard Time, 
[EMAIL PROTECTED] writes:

> Doug Ewell wrote:
>> Devanagari text encoded in SCSU occupies exactly 1 byte per
>> character, plus an additional byte near the start of the
>> file to set the current window (0x14 = SC4).
>
> The problem is what happens if that very byte gets corrupted for any
> reason...
>
> If an octet is erroneously deleted, changed or added from an UTF-8 stream,
> only a single character would be corrupted. If the same thing happens to the
> window-setting byte of a SCSU (or other similar "zany" formats), the whole
> stream turns into garbage.

Yes, SCSU is stateful and the corruption of a single tag, or argument to a 
tag, could potentially damage large amounts of text.  I know this was a big 
problem in the days of devices and transmission protocols that did little or 
no error correction.  I honestly don't know how big a problem it is today.

> What this means in practice for website developers is:
>
> 1) SCSU text can only be edited with a text editor which properly decodes
> the *whole* file on load and re-encodes it on save. On the other hand, UTF-8
> text can also be edited using an encoding-unaware editor, although non-ASCII
> text is invisible.

I have edited SCSU text using a completely encoding-ignorant MS-DOS editor.  
Of course I couldn't edit the SCSU control bytes intelligently, but then I 
can't edit multibyte UTF-8 sequences intelligently with it either.

> 2) SCSU text cannot be built by assembling binary pieces coming from
> external sources. E.g., you cannot get a SCSU-encoded template file and fill
> in the blanks with customer data coming from a SCSU-encoded database: each
> time you insert a piece of text coming from the database, you delete the
> current window information, turning into garbage the rest of the file.

The current window information is not deleted, it is carried over into any 
adjoining text that does not redefine it.  (This could have its own 
repercussions, of course.)

> 3) A SCSU page can only be accepted by browsers and e-mail readers that are
> able to decode it. On the other hand, UTF-8 also works on old ASCII-based
> browsers, although non-ASCII text is clearly not properly displayed.

Same as 1).  If you have only ASCII text, SCSU == UTF-8 == ASCII, and if you 
have non-ASCII text, both SCSU and UTF-8 encode that text with byte sequences 
that readers must know how to decode.  SCSU does use states, like any 
compression scheme, so an encoding-ignorant tool will probably have more 
trouble with SCSU than with UTF-8.  But I was not arguing to foist SCSU on an 
unprepared world, I was suggesting that the world should prepare.  \u263a

-Doug Ewell
 Fullerton, California




SCSU (was: Re: Devanagari)

2002-01-21 Thread DougEwell2

In a message dated 2002-01-21 1:33:23 Pacific Standard Time, 
[EMAIL PROTECTED] writes:

> Do you know of any published web pages that use SCSU? I think that's
> probably the place to start. I never add support for encodings I can't
> find in actual use on the web. (Hint hint. :)

This becomes a vicious circle, as it is just as reasonable to say that I 
never create Web pages in encodings that existing browsers can't support.

I'm not sure what is the best way to break this circle, except that when I do 
finally set up a Web site (\u263a) I might include a parallel SCSU version 
along with the UTF-8 version, along with a brief description of SCSU.

-Doug Ewell
 Fullerton, California




Re: Devanagari

2002-01-20 Thread DougEwell2

In a message dated 2002-01-20 21:49:02 Pacific Standard Time, 
[EMAIL PROTECTED] writes:

> The issue was originally brought up to gather opinion from members of this
> list as to whether UTF-8 or ISCII should be used for creating Devanagari web
> pages. The point is not to criticise Unicode but to gather opinions of
> informed persons (list members) and determine what is the best encoding for 
> information interchange in South-Asian scripts...

It seems that the only point against Unicode compared to ISCII is the 
resulting document size in bytes, and this one point is being given 100% 
focus in the comparison.

If the actual question is, "What is the most efficient encoding for 
Devanagari text, in terms of bytes, using only the most commonly encountered 
encoding schemes and no external compression?" then of course you will have 
loaded the question in favor of ISCII.

But when you consider that more browsers today around the world (not just in 
India) are equipped to handle Unicode than ISCII, and that Unicode allows not 
only the encoding of ASCII and Devanagari but the full complement of Indic 
scripts (Oriya, Gujarati, Tamil...) as well as any other script on the planet 
that you could realistically want to encode, you will probably have to 
rethink the cost/benefit tradeoff of Unicode.

-Doug Ewell
 Fullerton, California




Re: Devanagari

2002-01-20 Thread DougEwell2

In a message dated 2002-01-20 20:49:00 Pacific Standard Time, 
[EMAIL PROTECTED] writes:

> Usually, when someone offers
> a large body of plain text in any script, files are compressed 
> in one way or another in order to speed up downloads.

This is why I really wish that SCSU were considered a truly "standard" 
encoding scheme.  Even among the Unicode cognoscenti it is usually 
accompanied by disclaimers about "private agreement only" and "not suitable 
for use on the Internet," where the former claim is only true because of the 
self-perpetuating obscurity of SCSU and the latter seems completely 
unjustified.

Devanagari text encoded in SCSU occupies exactly 1 byte per character, plus 
an additional byte near the start of the file to set the current window (0x14 
= SC4).

-Doug Ewell
 Fullerton, California




Re: Devanagari

2002-01-20 Thread DougEwell2

In a message dated 2002-01-20 16:49:17 Pacific Standard Time, 
[EMAIL PROTECTED] writes:

> The point was that a UTF-8 encoded HTML file for an English web page
> carrying say 10 gifs would have a file size one-third that for a Devanagari
> web page with the same no. of gifs...
> Therefore transmission of a Devanagari web page over a network would take
> thrice as long as that of an English web page using the same images and
> presenting the same information.

This conclusion ignores two obvious points, which Asmus already made:

(1) The 10 GIFs, each of which may well be larger than the HTML file, take 
the same amount of space regardless of the encoding of the HTML file.  The 
total number of bytes involved in transmitting a Web page includes 
everything, HTML and graphics, but the purported "factor of 3" applies only 
to the HTML.

(2) The markup in an HTML file, which comprises a significant portion of the 
file, is all ASCII.  So the "factor of 3" doesn't even apply to the entire 
HTML file, only the plain-text content portion.

In addition, text written in Devanagari includes plenty of instances of 
U+0020 SPACE, plus CR and/or LF, each of which which occupies one byte each 
regardless of the encoding.

I think before worrying about the performance and storage effect on Web pages 
due to UTF-8, it might help to do some profiling and see what the actual 
impact is.

-Doug Ewell
 Fullerton, California




Re: The benefit of a symbol for 2 pi

2002-01-19 Thread DougEwell2

In a message dated 2002-01-19 17:07:34 Pacific Standard Time, 
[EMAIL PROTECTED] writes:

> In fact Cajori mentions that
> the capital pi Was used at some point for 6.28... so someone had
> the same idea long before I did.

That is a VERY intriguing thought, one that should be especially worthy of 
mention to the AMS.  I thought capital pi already had an established meaning, 
but perhaps that is in physics or some other branch of science rather than 
mathematics.

-Doug Ewell
 Fullerton, California




Re: The benefit of a symbol for 2 pi

2002-01-19 Thread DougEwell2

In a message dated 2002-01-19 11:35:57 Pacific Standard Time, 
[EMAIL PROTECTED] writes:

>> Could one of these characters,
>> already approved and part of Unicode, be adopted to represent 2pi?
>
> That's up the the AMS, not to us.

Indeed.  It might be a good topic for the AMS discussion forum that Robert 
mentioned.

-Doug Ewell
 Fullerton, California




Copyleft (was: Re: The benefit of a symbol for 2 pi)

2002-01-19 Thread DougEwell2

In a private message dated 2002-01-18 1:03:05 Pacific Standard Time, 
[EMAIL PROTECTED] writes:

> And what is wrong with copyleft?

The symbol wasn't in general use in plain text, only as a logo (e.g. on 
T-shirts and bumper stickers) promoting the Free Software cause.

I should not have used the copyleft symbol as an example of the UTC and/or 
WG2 not wanting to make a social or political statement.  I had no basis for 
making that claim.  However, the lack of usage in plain text was an 
appropriate reason for not encoding it.

Like newpi, the copyleft symbol would probably be worth a second look if its 
proponents could gather enough, and diverse enough, examples of its usage.

(I apologize to juuitchan for quoting and responding to his private message 
on the public list, but felt the response might be of general interest.)

-Doug Ewell
 Fullerton, California




Re: The benefit of a symbol for 2 pi

2002-01-19 Thread DougEwell2

In a message dated 2002-01-19 9:33:46 Pacific Standard Time, 
[EMAIL PROTECTED] writes:

>> Has there been any consideration of practical alternatives, such as
>>selecting a lookalike or similar character from the plethora of those
>>already encoded and promoting its use to represent the "newpi"
>> character?
>
> My own proposal was a pictogram: A circle with a radius to "3 o'clock",
> i.e. from 0 to 1 in the complex number plane. Pacman with mouth closed.
> Does that already exist in Unicode? :-) My dad's version is a lot more
> palatable for most people. 

A large number of glyphic variations of Latin and Greek letters were just 
added to Unicode 3.1 with the sole purpose of serving as mathematical 
identifiers.  Apparently it was stressed by the AMS and others that these 
variations (bold, italic, double-struck, sans serif, etc.) are all 
significant and distinct in math notation.  Could one of these characters, 
already approved and part of Unicode, be adopted to represent 2pi?

In particular, consider the following already-encoded variants of pi:

U+1D6B7 MATHEMATICAL BOLD CAPITAL PI
U+1D6D1 MATHEMATICAL BOLD SMALL PI
U+1D6E1 MATHEMATICAL BOLD PI SYMBOL
U+1D6F1 MATHEMATICAL ITALIC CAPITAL PI
U+1D70B MATHEMATICAL ITALIC SMALL PI
U+1D71B MATHEMATICAL ITALIC PI SYMBOL
U+1D72B MATHEMATICAL BOLD ITALIC CAPITAL PI
U+1D745 MATHEMATICAL BOLD ITALIC SMALL PI
U+1D755 MATHEMATICAL BOLD ITALIC PI SYMBOL
U+1D765 MATHEMATICAL SANS-SERIF BOLD CAPITAL PI
U+1D77F MATHEMATICAL SANS-SERIF BOLD SMALL PI
U+1D78F MATHEMATICAL SANS-SERIF BOLD PI SYMBOL
U+1D79F MATHEMATICAL SANS-SERIF BOLD ITALIC CAPITAL PI
U+1D7B9 MATHEMATICAL SANS-SERIF BOLD ITALIC SMALL PI
U+1D7C9 MATHEMATICAL SANS-SERIF BOLD ITALIC PI SYMBOL

Now, it may well be that a typographical variant of pi is not the best choice 
to represent (2 * pi).  That's OK, there are still hundreds more of these 
math-specific characters to choose from.

-Doug Ewell
 Fullerton, California




Re: The benefit of a symbol for 2 pi

2002-01-18 Thread DougEwell2

Robert Palais wrote:

> I will be doing so, and apologize if my inquiry intruded on your
> work, and at the same time, appreciate the many thoughtful
> considerations on the matter of process of symbol standardization
> that I received.

and later:

> I apologize again if my misunderstanding that I was advised to
> bring it up directly here offended you...

Please be assured that nobody here feels offended, intruded upon, or 
otherwise discommoded as a result of any of your inquiries.  You have every 
right, and in fact you are strongly encouraged, to discuss your proposed new 
character and ask questions about adding it to Unicode.

For my part at least, I feel it is important to explain to proponents WHY 
their proposed characters may not be suitable for encoding, rather than 
simply telling them No.

> Observing your discussions, I do wonder if the participants
> recognize the responsibility of their influence upon ideas,
> through symbols

I think the Unicode Consortium and WG2 do understand this, and that is why 
they are so reluctant to encode symbols that do not have established usage, 
as in the case of 2 pi, or seek to make a social or political statement that 
the Consortium and WG2 do not intend, as in the case of copyleft.

> (but it seems some may enjoy it too much.)

I am totally mystified by this remark.

-Doug Ewell
 Fullerton, California




Re: The benefit of a symbol for 2 pi

2002-01-17 Thread DougEwell2

This discussion has sparked a few lively contributions and brought up some 
important points, so even though it may have been beaten to death and Robert 
has announced his intent to move it to another forum, I still have some 
comments that may be pertinent in the AMS discussion.

I was a little disappointed that others were drawn into the debate over the 
relative merits of the constants 3.14 versus 6.28, since this issue is 
completely irrelevant to the question of encoding a new "2 pi" symbol in 
Unicode.  For most of us, at least, the objection to encoding this symbol in 
Unicode has nothing at all to do with its theoretical usefulness, but its 
lack of currency at the present time.

Whether the symbol would be useful or represents an important mathematical 
constant is not the point.  It must be commonly used, or at least recognized, 
within the field.  It is not a question of whether typographers would 
personally see the benefit of such a symbol (BTW, not all of us are 
typographers; this list also includes software developers, linguists, and 
standardization types).  The symbol must already be in use, as determined by 
a sufficient body of work.  Mathematicians are researchers; they know it is 
not sufficient to cite a single article, especially one written by oneself or 
one's associates, as a "body of work" to demonstrate the use or non-use of 
something.

The 2 pi symbol is an experiment, and it is important to remember that not 
all experiments are successful!  Some proposed characters, words, ideas, TV 
shows, etc. do not achieve a sufficient level of popularity and are 
discarded.  As Rick McGowan indicated, it is not a goal of Unicode to encode 
characters that someone, even a lot of people, believe *might* (or should) 
achieve widespread use; they must already be in use (with one notable 
exception; see below).

Think of a dictionary.  New words are invented all the time, sometimes 
intentionally by companies or advertisers, yet nobody would think of going to 
Merriam-Webster and asking them to include their newly invented word in the 
next edition of the dictionary so that it will be recognized and will gain 
greater use.  Rather, the word has to gain a certain degree of acceptance 
*before* it is enshrined in the dictionary.  The same is true for encoding 
characters in Unicode.  If Dr. Beebe suggested trying to get the 2 pi 
character into Unicode to stimulate its adoption, then he does not understand 
the principles and policies of Unicode.

I wonder if there is a perception, because of the extensive work done by the 
Unicode Consortium and ISO/IEC JTC1/SC2/WG2 in encoding over 95,000 
characters, that any newly invented character or symbol can be encoded just 
for the asking.  The fact is that there are over 95,000 characters in 
Unicode, not because the relevant committees are fast and loose in encoding 
newly invented characters, but because there really are that many 
well-attested characters in the world.  (Well, OK, minus some of the 
compatibility characters.)

David Starner mentioned the proposed "copyleft" sign.  This was a reversed 
copyright sign (roughly representable by U+0254 plus U+20DD) which enjoys 
some use by the Free Software Foundation and the GNU project to signify a 
legal agreement that is similar to, but different from, a traditional 
copyright.  The symbol is apparently recognized by many adherents to the 
"free software" movement, probably more people than would recognize the 2 pi 
symbol.  But it turned out to be used primarily as a logo to promote the 
movement, rather than as a plain-text character to indicate the legal status 
of a work, as U+00A9 COPYRIGHT SIGN would be.  In fact, it would almost 
certainly not be recognized at all by non-FSF, non-GNU people except as a 
whimsical play on the copyright sign (the suggested name demonstrated the 
"anti-copyright" religion of the proponents).

The classic exception to the principle that a symbol must be in current use 
before it can be encoded is U+20AC EURO SIGN.  This character was encoded in 
Unicode 2.1 in 1998, long before most people -- including those who are now 
converting their money to euros -- had ever seen it.  But there was a 
difference: the Euro sign was invented by, and had the full support of, the 
European Monetary Union, and was *guaranteed* to become a commonly used 
symbol, something that cannot normally be said of most newly invented 
symbols.  Even though the symbol had never been used before, there was no 
question that it would "catch on."

A good measure of whether the 2 pi symbol has become sufficiently well 
recognized to be added to Unicode is whether it can be used in works like the 
JOMA article without having to explain or justify its usage beyond that which 
would be needed for any other symbol.

I hope that this discussion has shed some light on an important principle of 
Unicode for Robert (and perhaps for others), so that the AMS discussion can 
proceed in a productive m

Re: GBK Traditional to Simplified mapping table

2002-01-10 Thread DougEwell2

In a message dated 2002-01-10 20:04:50 Pacific Standard Time, 
[EMAIL PROTECTED] writes:

> So I'm still looking for a GBK-to-GBK mapping table that maps 
> traditional forms of Hanzi to their simplified equivalents.

The issue of mapping between traditional and simplified Chinese characters 
was debated at great length recently on the Internationalized Domain Names 
(IDN) mailing list.  It seems like a great idea, but it is simply not 
practical.  A great many TC characters do not have a 1-to-1 mapping to SC, 
and vice versa.  Far from being a simple operation like Latin case mapping 
(to which it was compared), TC/SC requires potentially complex analysis of 
the text being converted.

This is the opinion of many experts within, as well as outside, the Unicode 
standardization effort, and it is the reason you will not find a Unicode 
TC/SC mapping table.

-Doug Ewell
 Fullerton, California




Re: Tengwar added to Plane1 Unicode Demo Page

2002-01-07 Thread DougEwell2

In a message dated 2002-01-07 10:31:32 Pacific Standard Time, 
[EMAIL PROTECTED] writes:

> Making an entry in this demo file for a proposed Plane 1 script that is  
> NOT IN UNICODE, is both premature and dangerous. It has not been discussed  
> in committee, and a spot for Tengwar on the roadmap is absolutely NO  
> GUARANTEE of any future disposition.

There's another problem.  Even if Tengwar's slot in the U+12000 block of the 
Roadmap somehow assured us that it would one day be encoded there, there 
would still be no guarantee that its exact layout would be the same as 
proposed.

Deseret was laid out one way for its 1997 CSUR proposal, then a different way 
by Michael Everson (ISO/IEC JTC1/SC2/WG2 N1891) a year later, and finally a 
third way when it was finally encoded in Unicode 3.1.  And there were fewer 
questions concerning the encoding of Deseret than that of Tengwar.

> If someone feels compelled to make an entry for Tengwar in any demo,  
> please do it in the PUA so that people don't start getting the idea that  
> Tengwar is encoded, because it's not encoded, and is not going to be  
> encoded any time soon.

The only proper way at present to represent Tengwar in Unicode is according 
to the CSUR proposal, in the U+E000 block.  The use of undesignated code 
space in Unicode is a Bad Thing.  This should not be a question of whether 
one "dislikes" or "rejects" CSUR; it is, quite simply, the *least 
non-standard* way to do it.

-Doug Ewell
 Fullerton, California




[OT] Re: about mail list

2002-01-07 Thread DougEwell2

In a message dated 2002-01-07 19:54:16 Pacific Standard Time, 
[EMAIL PROTECTED] writes:

> i am adding a mail addresslist to mozilla in address books page the same as
> the tree directory of "personal address book and collected addresses "

... blah blah blah ...

> who can tell me how can the namespace
> label="rdf:http://home.netscape.com/NC-rdf#DirName";
> and IsMailList="rdf:http://home.netscape.com/NC-rdf
> #IsMailList" contact with the resource 
> and where the function is to  deal with the button.

Who can tell *me* how this query ended up on the Unicode list?

-Doug Ewell
 Fullerton, California




Re: Vertical scripts (was: Tategaki (was: Re: Updated...))

2002-01-05 Thread DougEwell2

I hasten to add:

> UTF-8 and UTF-32, at least, already have the architecture 
> to represent 2^31 and 2^32 code points, respectively.  The definitions 
would 
> simply have to changed to make the additional code points legal.
>
> Only UTF-16 would truly need to be redesigned, and that has already been 
> proposed.

None of this is actually going to happen, of course.  Unicode and 10646 are 
committed to staying with 17 planes.  I was just pointing out that certain 
individuals had made informal proposals to extend the code space.

-Doug Ewell
 Fullerton, California




Re: Vertical scripts (was: Tategaki (was: Re: Updated...))

2002-01-05 Thread DougEwell2

In a message dated 2002-01-02 5:05:23 Pacific Standard Time, 
[EMAIL PROTECTED] writes:

> There are worse things than thi: what if someone discovers a script with
> more than 1,114,111 characters? Back to the drawing board to redesign all
> the UTF's!

Not all of them.  UTF-8 and UTF-32, at least, already have the architecture 
to represent 2^31 and 2^32 code points, respectively.  The definitions would 
simply have to changed to make the additional code points legal.

Only UTF-16 would truly need to be redesigned, and that has already been 
proposed.  For example, Masahiko Maedera once proposed a "UTF-16x" in which 
code points in the U+EExxx block were designated as "super surrogates."  
Three of these "super surrogates," or six 16-bit words, would be combined to 
represent code points beyond plane 17.  (This was back in the days when some 
people felt that a great and crippling schism existed between Unicode and ISO 
10646 because the former disallowed such code points and the latter allowed 
them.)

-Doug Ewell
 Fullerton, California




Re: Vertical scripts (was: Tategaki (was: Re: Updated...))

2001-12-29 Thread DougEwell2

Tex Texin replied to Marco Cimarosti:

>> Right-to-left vs. left-to-right are attributes of arbitrary *spans* of 
text,
>> which can easily be mixed within the same paragraph.
>>
>> On the other hand, horizontal vs. vertical are attributes that can be only
>> be applied to a whole paragraph or section.
>
> Marco, is that true? I thought that sometimes numbers for example "123."
> might be written horizontally in the middle of a vertical run.

Marco responded:

> But that would a limited case for horizontal text embedded in vertical text:
> I cannot imagine a real-world situation for a vertical text embedded in
> horizontal text.

And Sampo Syreeni weighed in:

> I think this is something better handled by special-casing in rendering
> software -- the numbers (and whatnot) could be rendered as rotated or
> straight top-to-bottom as well. Considering this, it seems like a
> stylistic variation better controlled by an upper level protocol, if at
> all.

Tex's example may or may not be realistic -- I have no way of knowing -- but 
in suggesting a top-to-bottom directional override, I had hoped it would be 
possible to represent a run of text such as Tex describes without resorting 
to the infamous "higher protocol."

TUS 3.0 states (p. 24): "In contrast to the bidirectional case, the choice to 
lay out text either vertically or horizontally is treated as a formatting 
style.  Therefore, the Unicode Standard does not provide directionality 
controls to specify that choice."  This may seem arbitrary to some; why 
should overrides of default horizontal directionality be a plain-text issue 
but overrides of default vertical directionality be a higher-level 
"formatting style" issue?  I hope this discussion can shed some light on this 
question, and possibly help me see what I may be missing.

Actually, there is a more serious problem involved with vertical directional 
overrides: They would force the Unicode plain-text mechanism to become aware 
of both vertical directionality and directional priority.  This sounds 
obvious, but in fact there are not two, but THREE issues involved with text 
directionality:

1.  Horizontal, that is, left-to-right (LTR) versus right-to-left (RTL).
2.  Vertical, that is, top-to-bottom (TTB) versus bottom-to-top (BTT).
3.  Priority of direction (e.g. (LTR, TTB) versus (TTB, LTR)).

If you think about it, all text of non-trivial length has both horizontal and 
vertical directionality, and also a priority to the directionality.  
Horizontal and vertical directionalities are not opposites, they are 
complements.  The Latin script is written (LTR, TTB) which means not only 
that there is a horizontal directionality of left-to-right and a vertical 
directionality of top-to-bottom, but also that the horizontal directionality 
takes precedence over the vertical.  That is, we complete a horizontal (LTR) 
line before moving down the page (TTB) to start another line.

According to TUS 3.0,
Latin and most other European scripts are (LTR, TTB).
Arabic and most other Middle Eastern scripts are (RTL, TTB).
Ogham is either (LTR, TTB) or (BTT, ???).
Han is traditionally written (TTB, RTL) and more recently (LTR, TTB).
Mongolian is written (TTB, LTR).

Unicode characters have a default directionality, but both this and the 
override mechanism cover only the horizontal aspect, not the vertical aspect 
or the priority of one over the other.  Thus, Mongolian characters are 
assigned the same directionality code as Latin ("L") even though the TTB 
directionality takes precedence over the LTR, the opposite of Latin.  And 
there is no plain-text way to indicate the alternative directionality of 
Ogham or Han.

An elaboration of the directional override mechanism to handle vertical 
directionality would have to take priority into account as well.  Instead of 
two directionalities, LTR and RTL, the Unicode Standard would have to 
consider eight.  The Bidirectional Algorithm might have to become 
Octodirectional, with a commensurate increase in complexity.  Perhaps this is 
the problem that is avoided by declaring vertical directionality to be a 
higher-level "formatting style" issue.  But it still seems arbitrary.

-Doug Ewell
 Fullerton, California




Fact vs. fiction

2001-12-26 Thread DougEwell2

You know you're spending too much time thinking about Unicode when you hear 
about the new "Lord of the Rings" movie and your first thought is about 
Tengwar and Cirth, the scripts invented by Tolkien and encoded in the 
ConScript Unicode Registry 
 but not, at least yet, in 
Unicode proper.

If you have seen the movie or even the trailer, you may have noticed the 
Tengwar inscription on the ring.  I thought, "Uh oh, I wonder if I'm going to 
have to learn Tengwar now."  I never learned it due to the spookily abstract 
way it was described in the ConScript proposal:

"The Tengwar script is a system of consonantal signs without strictly fixed 
values; their glyphic structure comprises a matrix of potential phonetic 
relationships, rather than a set of fixed relationships between sound and 
character."

Whoa, that's deep.

Anyway, the main point of this post is to show that some folks still haven't 
grasped the distinction between fact (full-fledged Unicode) and fiction 
(ConScript).  If you visit the official Web site for the "Lord of the Rings" 
movie at  and click on "Tengwar Links," you 
will discover the following links:

> Unicode Standard: Cirth ConScript
> Unicode specifications of the Cirith [sic] script.
>
> Unicode Standard: Tengwar ConScript
> Unicode specifications of the Tengwar script.

So now a whole lot of Tolkien fans, old and new, are going to find these 
links to the CSUR pages for Tengwar and Cirth, without knowing the difference 
between Unicode and ConScript and without seeing the main CSUR page that 
explains this difference, and will conclude that these two Elvish scripts are 
officially encoded in Unicode.  Oh well.

-Doug Ewell
 Fullerton, California




Re: Vertical scripts (was: Tategaki (was: Re: Updated...))

2001-12-25 Thread DougEwell2

In a message dated 2001-12-25 16:57:39 Pacific Standard Time, 
[EMAIL PROTECTED] writes:

> Unicode doesn't have some way to indicate vertical writing. I think the
> only consideration for it is vertical presentation forms of some
> characters. Anything more is left for other software layers to deal
> with.

Seeing that Unicode already has left-to-right and right-to-left override 
characters, I wonder if a top-to-bottom override character might also be 
reasonable.

-Doug Ewell
 Fullerton, California




CESU-8 marches on

2001-12-22 Thread DougEwell2

Without any fanfare, at least on this public mailing list, the proposed 
Unicode Technical Report #26 defining CESU-8 (Compatibility Encoding Scheme 
for UTF-16: 8-Bit) has been upgraded in the past week from "Proposed Draft" 
status to "Draft" status.  That means CESU-8 is moving forward along the road 
to approval by the UTC, however smooth or rocky that road may be.

So it seems like a sensible time to get back on my soapbox about CESU-8, ask 
the pivotal question once again concerning the motivation for this new 
scheme, and point out a lingering error in the TR while I'm at it.

CESU-8, for those who may have forgotten or repressed it, is a variation of 
UTF-8 which encodes supplementary characters in six bytes instead of four 
bytes.  Essentially, it is UTF-8 applied to UTF-16 code units instead of 
Unicode scalar values.  The UTF-16 transformation is applied to each 
supplementary character, breaking it into a high surrogate and a low 
surrogate, and then the UTF-8 transformation is applied to the two 
surrogates, so that each is encoded in three bytes.

CESU-8 was originally called UTF-8S, at least on this list, the "S" 
presumably denoting the variant encoding of Surrogates.  It has been promoted 
by representatives of Oracle, notably Jianping Yang, and PeopleSoft, notably 
Toby Phipps (the author of DUTR #26), as a way to ensure that Unicode data is 
sorted consistently in UTF-16 code-point binary order.

Several people on this list, including me, have been critical of CESU-8, 
claiming that UTF-16 code-point order is not a suitable collation order and 
should not serve as the basis of a new (or hacked) UTF.  UTF's are supposed 
to be character encoding forms (cf. UTR #17, "Character Encoding Model") that 
map Unicode scalar values to sequences of bytes, words, double-words, etc.  
You're not supposed to piggyback a UTF on top of another UTF, the way CESU-8 
sits on top of UTF-16.

The critics of CESU-8 claim its reason for existence is that the database 
vendors have been ignoring the designation of the supplementary code space 
and have handled "Unicode" as surrogate-unaware UCS-2.  Now that 
supplementary characters have become a reality (as of Unicode 3.1), the 
vendors have chosen to promote this new encoding scheme instead of either (a) 
fixing the sort order of existing database engines to sort supplementary 
characters properly, AFTER basic characters, or (b) making a small 
modification to their sort routines to sort normal UTF-8 data in the 
idiosyncratic UCS-2-like order.

There is also a concern that CESU-8 is really just a variation of UTF-8, 
allowing (nay, requiring) sequences that are illegal in UTF-8 but otherwise 
looking just like UTF-8.  This could open security holes that the UTC has 
worked hard to close, and is continuing to close in Unicode 3.2.

Finally, although the promoters claim that this mutant form of UTF-8 is only 
for internal use within closed systems (which would make it completely 
unnecessary for the Unicode Consortium to sanction, describe, or even 
acknowledge it), they have not only written a Technical Report to describe it 
to the public but have announced their intent to register it with the IANA, a 
major step toward open interchange of CESU-8 data.  (It was claimed that the 
IANA registration was intended to pre-empt some other party from registering 
CESU-8 with IANA, but I don't see what difference this would make or how the 
pre-emptive action would help anything.)

The promoters of CESU-8 say that data in this format already exists in the 
real world, and the purpose in describing it in a UTR is to codify an 
existing de-facto standard.  For me, there is one question that seeks to 
explain the real motivation behind CESU-8.  We know that basic (BMP) 
characters are encoded exactly the same in UTF-8 and CESU-8.  We also know 
that, although the supplementary space has been designated for many years, no 
actual supplementary characters (with the exception of private use planes 15 
and 16) were encoded, and thus allowed for interchange, until the publication 
of Unicode 3.1 earlier this year.

Furthermore, we know what characters are currently (Unicode 3.1) encoded in 
the supplementary space:  the ancient Old Italic and Gothic scripts; the 
Deseret script, which has not been actively promoted for 130 years; a large 
set of musical and mathematical symbols; the Plane 14 language tags; and 
several thousand Han characters.  The Han characters are generally thought to 
be less commonly used than those in the BMP; otherwise (so the story goes) 
they would have been encoded in Unicode sooner.  Remember that none of these 
non-BMP characters could be conformantly used (e.g. stored in a database) 
until the publication of Unicode 3.1.

So my question is:  What supplementary characters are currently, TODAY, 
stored in Oracle or PeopleSoft databases that require the creation of a new 
encoding scheme to ensure they can continue to be sorted consistently?

I s

[OT] Re: The virus

2001-12-12 Thread DougEwell2

In a message dated 2001-12-12 1:55:07 Pacific Standard Time, 
[EMAIL PROTECTED] writes:

> Any operating system worthy of the name wouldn't process any
> kind of virus and ISPs should live up to their obligation of
> Providing Service and bounce any message containing a virus.

A virus that would be processed *by the operating system* is just an 
executable program.  It would be extremely difficult -- "impossible" is a 
word we all try to avoid -- to detect, before running a program, whether it 
replicates itself and/or causes harmful side effects.  That is the job of 
anti-virus software.  And Microsoft is now finding itself in legal trouble 
for bundling "extras" like anti-virus software into their operating systems 
and thereby driving specialized companies out of business.

Both of the systems I use run Microsoft OS's (Windows 95 (!) at home and 2000 
at work), and yet I almost never get affected by these viruses.  At home I 
use CompuServe 5.0, and at work I use Lotus cc:Mail 8.5.  Neither of these 
e-mail clients will download, let alone execute, any attachment (including a 
virus) without user intervention.  The gatekeeper for e-mail viruses is the 
e-mail client, not the operating system.

As Clive Hohberger said, Microsoft Outlook is the preferred target for e-mail 
viruses, not only because it is the "biggest and most symbolic target around" 
but also because it is the closest thing there is to a "standard" e-mail 
client.  Imagine writing an e-mail virus that sent itself to everyone in your 
CompuServe address book, or your cc:Mail address book.  That would be like 
crashing an airplane into a dollhouse.

-Doug Ewell
 Fullerton, California




Re: Are these characters encoded?

2001-12-04 Thread DougEwell2

In a message dated 2001-12-04 2:48:55 Pacific Standard Time, 
[EMAIL PROTECTED] writes:

> The overbar being a flat form of tilde, which in medieval hands were 
> used to indicate an omitted m or n following.

Ah.  So it is "cum" after all.  Thank you.

-Doug Ewell
 Fullerton, California




Re: Are these characters encoded?

2001-12-03 Thread DougEwell2

In a message dated 2001-12-03 12:20:46 Pacific Standard Time, [EMAIL PROTECTED] 
writes:

>  Perhaps a corruption of "c-overbar," which is a medical abbreviaton for
>  "with," sometimes used by nurses, doctors, and pharmacies?

Thanks to everyone who, directly or indirectly, corrected me on this 
character.  Yes, you are all right: the character used in (as it turns out) 
the medical field to mean "with" is, in fact, c-overbar and not c-underbar.  
In Unicode we would say U+0063 U+0305.

So to get back to my original questions about this thing, (a) is it a 
character in its own right, (b) if so, is there any justification in encoding 
it separately rather than using a combining sequence, and (c) is this not 
*exactly* the same set of issues as the question of encoding the Swedish 
o-underbar?

-Doug Ewell
 Fullerton, California




Unicode 1.0 names for control characters

2001-12-03 Thread DougEwell2

I am surprised and puzzled by the "Unicode 1.0 Name" changes for some of the 
ASCII and Latin-1 control characters that were introduced in the latest beta 
version of the Unicode 3.2 data file (UnicodeData-3.2.0d5.txt):

U+0009  HORIZONTAL TABULATION  ==>  CHARACTER TABULATION
U+000B  VERTICAL TABULATION  ==>  LINE TABULATION
U+001C  FILE SEPARATOR  ==>  INFORMATION SEPARATOR FOUR
U+001D  GROUP SEPARATOR  ==>  INFORMATION SEPARATOR THREE
U+001E  RECORD SEPARATOR  ==>  INFORMATION SEPARATOR TWO
U+001F  UNIT SEPARATOR  ==>  INFORMATION SEPARATOR ONE
U+008B  PARTIAL LINE DOWN  ==>  PARTIAL LINE FORWARD
U+008C  PARTIAL LINE UP  ==>  PARTIAL LINE BACKWARD

Were these "new" names (e.g. CHARACTER TABULATION) really the original 
Unicode 1.0 names?  I don't have my 1.0 book close at hand, but I know that 
they were *not* the names used in 1.1, according to the file "namesall.lst" 
from that version.  (Aha, didn't think anyone still had that dusty old thing 
lying around?)

IMHO, the new names CHARACTER TABULATION and LINE TABULATION are much less 
intuitive than HORIZONTAL TABULATION and VERTICAL TABULATION.  Sometimes you 
even see the abbrevations HT and VT for these two characters.  The new names 
appear to have been invented by someone who imagined a lack of clarity in the 
old names.

I have seen the names IS4, IS3, IS2, and IS1 before, but they do not convey 
the same information as FS, GS, RS, and US.  The latter names are more 
specific.

The "old" names for these six control characters were used as far back as the 
original 1963 version of ASCII, according to Mackenzie (pp. 245-247).

I don't know about the history of U+008B and U+008C, but again it seems 
strange that the "Unicode 1.0 name" for these characters is being changed at 
this late date.

I know this 1.0 name field is not subject to the same rule of "no changes, 
ever" that applies to the regular Character Name field, but why should these 
names be changed at all?

On this same topic, parenthesized abbreviations have been added to the 1.0 
names for U+000A LIFE FEED (LF), U+000C FORM FEED (FF), U+000D CARRIAGE 
RETURN (CR), and U+0085 NEXT LINE (NEL).  Does the addition of these 
abbreviations mean that they are now part of the official 1.0 name, and if 
so, why?  Other characters typically don't have abbreviations as part of 
their names, even if they are as meaningful and as commonly used as these, 
and again it is a change from the 1.0 name we have seen for a decade.

Perhaps I've been checking the beta files a bit TOO carefully.

-Doug Ewell
 Fullerton, California




Re: Are these characters encoded?

2001-12-02 Thread DougEwell2

In a message dated 2001-12-02 11:00:32 Pacific Standard Time, 
[EMAIL PROTECTED] writes:

> "o." and "o-with-underscore" are NOT glyph variants of a ligature of 
> e and t (at a character level), no matter what they mean.

I suggested that Stefan's o-underscore "and" might OR might not be a 
variation of the ampersand, in all its many existing glyph variants.

The "glyph variant" side is bolstered by the argument that it's a symbol, 
just like &, used to mean "and" without any translation necessarily taking 
place; that it's only used in Swedish; and that users consider it equivalent 
to & and use different forms depending on whether the text is handwritten or 
typed.

The "separate character" side can point to the fact that its derivation is 
completely different from that of &; that it looks nothing like any of the 
existing forms of & (like TIRONIAN SIGN ET); and that it's only used in 
Swedish (cf. GREEK QUESTION MARK).

I don't think there is one obvious answer to this.  I will say this, however: 
The majority of posts stating that some character or other is "not in 
Unicode" turn out to be bogus; the proposed character is really a glyph 
variant or presentation form.  Stefan's original post had the following three 
points:

1.  Swedish "o-underscore" -- maybe, maybe not
2.  Fraction slash -- already encoded
3.  Roman numerals -- overextension of compatibility forms; rendering issue

When two of three proposals can be quickly blown off, it is human nature that 
sometimes it is difficult to see the potential virtue in the third.

I also want to say that, although Michael is of course correct that & was 
originally a ligature of e and t, many, many of the & glyphs seen today do 
not even remotely resemble such a ligature.  Consider the top three glyphs in 
the attached GIF (only 290 bytes).  The first is obviously still an e-t 
ligature, the second is one with centuries of typographical evolution applied 
to it (and today more closely resembles a treble clef), the third is not at 
all.  If traceability to the original Latin "et" were what made these 
characters the same or different, then that might have spoken against the 
separate encoding of TIRONIAN SIGN ET.

I never think of & as meaning "et," even the glyph variants that do look like 
an e-t ligature.  I assume that practically all users of this symbol treat it 
as a logograph meaning "and" in the language of the surrounding text.  (I 
have, rarely, seen & used in Spanish text, which strikes me as funny since 
the Spanish words for "and" ("y" and "e") would not seem to need 
abbreviating.)

So the question might be posed, do Swedish users think of o-underscore as a 
logograph meaning "och" or as an abbreviation for the spelled-out word "och"?

In a message dated 2001-12-02 9:23:51 Pacific Standard Time, 
[EMAIL PROTECTED] writes:

>>> Having said that, it seems to me that U+00B0 would represent Stefan's
>>> character easily enough.
>>
>> No. It's not a degree sign.  Nor is 00BA appropriate: the underlined o is
>> not superscripted/raised (much, if at all).
>
> Sorry, I did mean U+00BA, and subscription or superscription of the 
> glyph in that character is a matter of glyph choice.

I think, though, that use of U+00BA MASCULINE ORDINAL INDICATOR would be a 
classic example of hijacking a character for an unintended and inappropriate 
purpose simply because its glyph looks "close enough."  This would be like 
using U+003B at the end of a Greek question.  I stick to my original 
suggestion of U+006F U+0332, crossing my fingers that rendering engines will 
handle this correctly.

-Doug Ewell
 Fullerton, California




Re: Are these characters encoded?

2001-12-01 Thread DougEwell2

At 2001-12-01 11:24:04 Pacific Standard Time, 
[EMAIL PROTECTED] (Stefan Persson) wrote:

> I was thinking if this was encoded:
>
> 1.) Swedish ampersand (see "&.bmp"). It's an "o" (for "och", i.e. "and")
> with a line below. In handwritten text it is almost always used instead of
> &, in machine-written text I don't think I've ever seen it.

This might be a character in its own right, as different from the ampersand 
as U+204A TIRONIAN SIGN ET.  Or it might be simply a glyph variant of  the 
ampersand.  If you have never seen o-underbar in machine-written text, I 
doubt that this will help your cause much.  You might try U+006F U+0332, 
though this will probably not give you the vertical spacing you expect.

(As a side note, this "o-underbar" form reminds me of the "c-underbar" which 
is sometimes used in handwritten English to mean "with."  Does anyone know 
the origin of this symbol?  Is it possibly derived from the Latin word cum, 
meaning "with"?  Does it have any claim to being a character in its own 
right?)

> 2.) Fractions with any number, see "bråk.bmp."

U+2044 FRACTION SLASH is exactly what you are looking for.  Whether your 
browser or other rendering engine will display it the way you want is another 
matter.

On page 154 of TUS 3.0, there is a two-paragraph description of the use of 
U+2044.  Note particularly the sentence:

"The standard form of a fraction built using the fraction slash is defined as 
follows: Any sequence of one or more decimal digits, followed by the fraction 
slash, followed by any sequence of one or more decimal digits."

This would give you the results you expect for "123/456" but not for "x/y" or 
even "14658.48/13789".  However, it is not clear to me that this "standard 
form" is normative, and it is conceivable that a fraction-slash-aware 
renderer could generalize this to "one or more non-space characters, fraction 
slash, one or more non-space characters."

> 3.) Roman numerals. I know I-XII are encoded, but what if you want to use
> higher numbers? Typing "XX," you might suggest.

The set of Roman numerals, at least through 4999, can be completely specified 
with the characters U+2160 "I", U+2164 "V", U+2169 "X", U+216C "L", U+216D 
"C", U+216E "D", and U+216F "M" (or, of course, with the equivalent Latin 
letters).  According to TUS 3.0, page 299, "Upper- and lowercase variants of 
the Roman numerals through 12, plus L, C, D, and M, have been encoded for 
compatibility with East Asian standards."  Requests for additional 
precomposed Roman numerals will almost certainly be denied.

> This is not always
> sufficient; in Sweden we often put a line under and one above the numbers,
> see "Roma.bmp."

Sounds like a glyph-variant issue.  Font designers might want to ensure that 
the glyphs for the Roman numeral forms do have the over- and underlines.  
Then, if a user doesn't want them, she can always use the plain Latin letters 
instead.

> And what about ten thousands? Neither "X¯" nor "X¯" are
> displayed properly!

They should be; that's what the combining characters are there for.  (Hint: 
you want U+0305 COMBINING OVERLINE, not U+0304 COMBINING MACRON.)

To be fair to Stefan, most rendering engines have a long way to go to catch 
up with the Unicode ideal of being able to attach arbitrary combining marks 
(like U+0305) to arbitrary base characters (like U+2169).  Many renderers 
simply replace the sequence with a precomposed glyph.  This approach looks 
really sharp IF such a glyph is available, but breaks down otherwise.

-Doug Ewell
 Fullerton, California




Re: Call for Papers - 21st Unicode Conference - May 2002 - Dublin

2001-11-30 Thread DougEwell2

In a message dated 2001-11-30 11:24:57 Pacific Standard Time, 
[EMAIL PROTECTED] writes:

> Submissions should be sent by e-mail to either of the following
> addresses:
...
> They should use ASCII, non-compressed text and the following subject
> line:

This ASCII-only requirement for submitting papers for a Unicode Conference 
continues to amuse me.  I should think that UTF-8 at least would be 
acceptable.

-Doug Ewell
 Fullerton, California




Re: Comments on FCD 5218, "Codes for the representation of human sexes"

2001-11-29 Thread DougEwell2

In a message dated 2001-11-29 8:58:42 Pacific Standard Time, [EMAIL PROTECTED] 
writes:

> 
>  Please spread the word. My French colleagues are frustrated and embarassed 
>  by the continued propagation of this unfortunate myth.
>  


-Doug Ewell
 Fullerton, California




French uppercase accented letters (was: Re: Comments on FCD 5218)

2001-11-29 Thread DougEwell2

Sorry about the previous message.  I hit "Send Now" by accident while trying 
to select some text.

In a message dated 2001-11-29 8:58:42 Pacific Standard Time, [EMAIL PROTECTED] 
writes:

>  This in turn led to the myth that the French do not use 
>  uppercase accented letters...
>
>  Please spread the word. My French colleagues are frustrated and embarassed 
>  by the continued propagation of this unfortunate myth.

Unfortunately, I have seen this myth written in some relatively authoritative 
sources, including some from France.  For my part I am glad that it is not 
true, as it created yet one more annoying difference between French French 
vs. Canadian French that did not have to exist.

-Doug Ewell
 Fullerton, California




E-mail addresses for Viet-Std and TriChlor

2001-11-28 Thread DougEwell2

Does anyone know the current e-mail addresses for either the Vietnamese 
Standardization Group (Viet-Std) or their reputedly non-profit spinoff, the 
TriChlor Group?  The commonly referenced addresses based at haydn.stanford.edu
 seem to be non-functional.

Viet-Std is responsible for codifying the VISCII (8-bit) and VIQR (7-bit) 
encodings for Vietnamese.  TriChlor has written several software products 
that use these encodings.  I am in the process of writing converters between 
VIQR and Unicode and would like to contact these people concerning the fine 
details of these two encodings.

-Doug Ewell
 Fullerton, California




Error in Roadmap page

2001-11-27 Thread DougEwell2

The page "Roadmap to the SSP," located on the Unicode Web site at 
http://www.unicode.org/roadmaps/ssp-3-0.html, contains the following curious 
passage:

> Plane 14 is tentatively mapped out to the following zones.
>
> 000E-000E007F Tag characters
> 000E00A0-000E00FF unassigned
> 000E0180-000E01FF Variation Selectors
> 000E0080-000EFFFD unassigned

So far I have been unable to convince my brain that these ranges are correct.

-Doug Ewell
 Fullerton, California




Re: What is meant by '\u0000'?

2001-11-27 Thread DougEwell2

In a message dated 2001-11-26 10:42:17 Pacific Standard Time, 
[EMAIL PROTECTED] writes:

>> As for cut & paste, it might work among Microsoft Apps
>> but if one  wants to interface an app  with a disclosed
>> clipboard  format he will realize that he can not paste
>> unicode text that  contains '\u'  characters. Impossible.
>
> Does he mean specifically the character U+, or rather any character 
> referenced by hex codepoint?

Hmmm.  I hadn't looked at it that way.  No, I don't suppose you could paste 
the six ASCII characters "\u0410" into a Word document and get a Cyrillic A.  
But it would be a real stretch to claim that Microsoft's handling of Unicode 
is broken or idiosyncratic simply because their apps don't support this 
special notation.

> It would be useful to have a utility where you type text and out come the
> '\u' type strings (or else HTML hash codes) for use in a Java program 
or 
> Web page.

SC UniPad can convert Unicode text to and from both Java-style \u 
notation and HTML/SGML entities (either hex or decimal).  Visit 
http://www.unipad.org for a free download.

-Doug Ewell
 Fullerton, California




Re: The real solution

2001-11-25 Thread DougEwell2

In a message dated 2001-11-25 21:52:58 Pacific Standard Time, 
[EMAIL PROTECTED] writes:

>  As for cut & paste, it might work among Microsoft Apps
>  but if one  wants to interface an app  with a disclosed
>  clipboard  format he will realize that he can not paste
>  unicode text that  contains '\u'  characters. Impossible.

How much text exists in the real world that legitimately contains U+ 
characters that need to be cut and pasted?  What is the meaning or 
significance of these characters?  If Microsoft apps do not allow U+ to 
be cut and/or pasted, which I didn't know, this is probably non-conformant 
but does not seriously break the apps' Unicode support or show Unicode to be 
a non-interoperable standard.

>  And how about UCS-4 ? Forget it. As a text format it is not
>  even  existent.

"Not supported by most current software" would be more correct and more fair. 
 Most current Unicode-enabled software was not designed with supplementary 
characters in mind, and if you dismiss supplementary characters there is 
little point in supporting UCS-4 (UTF-32).

>  I think it would be much better to look for another
>  benchmark engine. If I were Unicode Consortium I would
>  build one. Just to  prove that the  standard works.
>  Wait... maybe it does not?

Wonder if any fish will bite at this flame bait...

-Doug Ewell
 Fullerton, California




Re: The real solution

2001-11-25 Thread DougEwell2

In a message dated 2001-11-25 10:19:59 Pacific Standard Time, 
[EMAIL PROTECTED] writes:

> This time i will raise the old issue with a new perspective and with more 
> practicality. The Unicode encoding is meant to help people around the world 
> to use characters in their own language besides English. But , 
unfortunately 
> this is not the case for Hindi , the third largest spoken language of the 
> world( spoken in around 10 countries). This is so because the Unicode 
> encoding for the Devnagari script has failed to do just this. The script 
that 
> is used for writing this language , i.e. Hindi and around 8 other 
languages. 

Unfortunately, Arjun has never presented any evidence that Unicode does not 
properly support the Devanagari script or any languages written in it, nor 
that he understands the Unicode Standard or the character-glyph model enough 
to see why it does.  All he has shown is that specific glyphs for half-forms 
are not separately encoded, which we already know and which is a moot point.

> Is this some kind of conspiracy to keep the use of Indic scripts from the 
Unicode system to the minimal.

Normally a "conspiracy theory" statement of this sort is grounds for 
immediate dismissal of the author, his message and any subsequent messages on 
the topic.  However, in hopes that Arjun can be persuaded to consider the 
facts of  Unicode, I will continue to address his points.

> This is because the Unicode system does not 
> provide means and ways to display and even more importantly store 
characters 
> for Devnagari in the way they should be(and the way in which they are used).

There is always room for different interpretations, in character encoding and 
other aspects of life, of how things "should be."  While it is almost 
certainly true that neither ISCII nor Unicode is a perfect system -- 
perfection, after all, is hard to find these days -- they are both fully 
capable of supporting Devanagari, given sufficient rendering technology.

> The people want a script system that they can use for every purpose ( 
> displaying , encoding new fonts, databases and every other purpose under 
the 
> sun) and not just for displaying characters (which unfortunately is left to 
> the mercy of the OS manufacturers) , even this function of it being not 
used 
> properly.

Unicode is NOT primarily a display or font technology.  To claim that it is 
merely reinforces that Arjun has not bothered to learn anything about Unicode.

> If anybody wants to see how the Devnagari encoding of Unicode should 
> actually look like , they can visit http://www.bharatbhasha.com and 
download 
> a font named Shusha .If they are not able to do this they can send me a 
> private e-mail at [EMAIL PROTECTED] and i will send them the font file 
> for Windows in an attachment.
> The above mentioned mentioned font has not been developed by me and 
> therefore should not be confused as a promotion through this forum.

Custom fonts and font switching are NOT the way to achieve interoperable 
encoding.  Someone who professes an interest in database storage should be 
especially aware of this.  The Vietnamese, who have relied for years on 
specialized fonts, are moving away from them and towards Unicode, and are 
experiencing significant improvements in interoperability.

Please visit www.unicode.org and read at least some of the Unicode Standard 
and Technical Reports, and then come back with specific questions or concerns 
about Devanagari that reference the standard or show some knowledge of it.
Please do not repeat the blanket claim that Unicode is inadequate to support 
Hindi (et al.) because half forms are not separately encoded.

-Doug Ewell
 Fullerton, California




Re: Unicode surrogates in browsers for the compelling demo

2001-11-18 Thread DougEwell2

In a message dated 2001-11-18 19:49:08 Pacific Standard Time, 
[EMAIL PROTECTED] writes:

>  Surely, the browser's lack of ability in being able to handle
>  UTF-8 for Plane One, while handling UTF-8 for Plane Two
>  just fine (on W2K), is a bug that should be fixed.  It also
>  illustrates that non-BMP support is still rather new, still
>  under testing, and still being developed.  Hopefully, future
>  updates of MSIE/Uniscribe will resolve these issues.

The time is NOW for all manufacturers of Unicode-enabled software, Microsoft 
and everyone else, to test their products with supplementary characters or 
code points and fix any problems.  Supplementary characters are part of 
Unicode TODAY.

Delaying this support any further will only cause headaches for the CJK users 
who can't view or print their personal name logographs encoded in the SIP, 
and who will find some way to blame Unicode for this.

-Doug Ewell
 Fullerton, California




Re: input method descriptions

2001-11-12 Thread DougEwell2

Peter Constable wrote:

>  Perhaps it might be possible to resolve 
>  this by simply stipulating that the left shift key is position B00 (say), 
>  though we may still run afowl of the next two issues.

It's not, actually.  Check Part 2, section 8.3.1:

All or part of the left-hand level 2 select key shall be in position B99.

(Remember that "level 2" is ISO 9995 terminology for Shift.)

Some keyboards (not mine) do have a B00 key that falls between the left shift 
key and position B01 (the 'Z' key on my QWERTY keyboard).  Backslash used to 
be placed here on some early PC keyboards.

The fact that not all physical layouts are identical creates some interesting 
problems, as Peter mentioned.  For example, using SC UniPad I can create a 
keyboard that assigns a character to the B00 key, but I can't type it since 
my keyboard has no such key.  I have to use the on-screen virtual keyboard 
(possibly in conjunction with Shift or Ctrl+Alt) to access it.  I imagine 
users of Keyman have encountered similar situations.

Peter already covered everything else I wanted to say about logical vs. 
physical keyboard layouts.

-Doug Ewell
 Fullerton, California




Re: ______ input methods?

2001-11-12 Thread DougEwell2

In a message dated 2001-11-08 7:19:19 Pacific Standard Time, 
[EMAIL PROTECTED] writes:

>  Any freely available info on ISO 9995, or do I have to pay through the 
>  nose for the privilege of learning about it?

Very good question.  I searched the Web numerous times and finally found some 
*draft* sections (not the entire standard) on a site run, maintained, or at 
least populated with documents by, Alain LaBonté.  I don't know if I was 
supposed to be forbidden from seeing/printing those documents, but Google led 
me astray, or whether they were open.

I'll see if I can find that site again and send you the URL.

Of course, Peter is spot on.  Naturally we all understand that standards 
organizations need to be funded, but all the same, it is ludicrous that in 
order to conform properly to an international standard (which exists to 
promote conformity), one must purchase it at an exorbitant price (USD 75 is 
typical).  I don't have a solution, at least not today, but the world 
desperately needs one.

Come to think of it, is there an ECMA equivalent to ISO 9995 (which would 
thus be available for free or almost free)?

-Doug Ewell
 Fullerton, California




Re: Thank you for all the good information, sUTF32ToUTF8 function

2001-11-08 Thread DougEwell2

In a message dated 2001-11-08 21:09:35 Pacific Standard Time, 
[EMAIL PROTECTED] writes:

>  Anybody willing to check this for me?
>
>  CString sUTF32ToUTF8( LONG lUTF32 )

I haven't run it through a compiler, but most of it looks fine.  However, the 
algorithm would be a lot more transparent if the constants were hex instead 
of decimal (e.g. 0x1000 instead of 4096).

Also, I would have written it to use bit shifts instead of divisions and 
modulos (IUTF32 >> 12 instead of lUTF32 / 4096).

And I don't think you're supposed to exclude the surrogate code space (0xD800 
through 0xDFFF) from normal processing.  (This is the "D29 conundrum" -- all 
UTFs must support encoding of non-characters, including unpaired surrogates, 
even though UTF-16 cannot do this.)  The code you provided encodes unpaired 
surrogates in four bytes -- by pushing them down to the final "else" -- which 
is wrong in any event and almost certainly not what the programmer intended.

-Doug Ewell
 Fullerton, California




Re: How to print the byte representation of a wchar_t string with non -ASCII ...

2001-11-01 Thread DougEwell2

In a message dated 2001-11-01 12:23:58 Pacific Standard Time, 
[EMAIL PROTECTED] writes:

>  > But won't this approach fail as soon as we hit a 0x00 byte (i.e. the
>  > high 8 bits of any Latin-1 character)?
>
> I'm not sure what you're alluding to here. As long as
>  all characters in wstr belong to the repertoire of the encoding/
>  character set of the current locale (that is, unless one
>  passes wstr containing Chinese characters to printf() in,
>  say, de_DE.ISO8859-1 locale),
>  there should not be any problem with using '%ls' to
>  print out wstr with printf(). Of course, 'printf ("%ls", wstr) '
>  doesn't achieve what the original question asked for, but that
>  question has already been answered, hasn't it?

OK, I freely admit that I didn't know what I was talking about and should not 
have gotten involved in this question.

William Tay had originally asked:

>  Say in C, I have wchar_t wstr[10] = L"fran";
>  Is there any printf or wchar equivalent function (using appropriate format
>  template) that prints out the string as 
>  66 72 C3 A1 6E in en_US.UTF-8 locale under UNIX?

and I doubted (and am still surprised) that something like this would be part 
of the standard library.  Doesn't seem very 'C' to me.  But apparently it is, 
so William is in luck and I have learned something new.

-Doug Ewell
 Fullerton, California




Re: Worst case scenarios on SCSU

2001-11-01 Thread DougEwell2

In a message dated 2001-10-31 15:54:34 Pacific Standard Time, 
[EMAIL PROTECTED] writes:

>  Has any one done worst case scenarios on SCSU, with respect to other
>  methods of encoding Unicode characters?

In addition to theoretical worst-case scenarios, it might also be worthwhile 
to consider the practical limitations of certain encoders.  SCSU does not 
require encoders to be able to utilize the entire syntax of SCSU, so in the 
extreme case, a maximally stupid SCSU "compressor" could simply quote every 
character as Unicode:

SQU hi-byte lo-byte SQU hi-byte lo-byte ...

This would result in a uniform 50% expansion over UTF-16, which is pretty bad.

On a more realistic level, even "good" SCSU encoders are not required by the 
specification to be infinitely intelligent and clever in their encoding.  For 
example, I think my encoder is pretty decent, but it encodes the Japanese 
example in UTF #6 in 180 bytes rather than the 178 bytes illustrated in the 
report.  This is because the Japanese data contains a couple of sequences of 
the form , where  are not compressible and  
are.  If there is only one  between the two , as in this case, 
it is more efficient to just stay in Unicode mode for the  rather than 
switching modes.  My encoder isn't currently bright enough to figure this 
out.  In the worst case, then, a long-enough sequence of 
... would take 5 bytes for every 2 BMP characters.

-Doug Ewell
 Fullerton, California




Re: Worst case scenarios on SCSU

2001-10-31 Thread DougEwell2

It must be a full moon on Halloween, because here I am in the extremely 
unfamiliar position of disagreeing quite strongly with Ken Whistler.

In a message dated 2001-10-31 17:16:25 Pacific Standard Time, [EMAIL PROTECTED] 
writes:

>  As current Czar of Names Rectification, I must start protesting
>  here. SCSU is a means of *compressing* Unicode text. It is
>  not "[an]other method of encoding Unicode characters."

I was about to reply, "Of course it is," before I realized that Ken was 
interpreting the word "encoding" in the strictest sense, invoking the 
distinction between character encoding forms (CEFs) and transfer encoding 
syntaxes (TESs).  In some cases this is a worthwhile distinction, but I don't 
think it is relevant in the case of David's query, or, for that matter, in 
many other cases where users may think of Unicode text being "represented" as 
UTF-32, UTF-16, UTF-8, SCSU, ASCII with UCN sequences, or even (God forbid) 
CESU-8.

SCSU is indeed another method of "representing" Unicode characters, if not 
necessarily "encoding" them in the strict sense of the word.

>  And before going on, I'm not clear exactly what you are
>  trying to do. SCSU is defined on UTF-16 text. It would, of
>  course, be possible to create SCSU-like windowing compression
>  schemes that would work on UTF-32 or UTF-8 text, but those are
>  not part of UTS #6 as it is currently written.

Like David, I don't see how SCSU is defined on, or limited to, UTF-16 text, 
except in the sense that literal or quoted "Unicode-mode" SCSU text is 
UTF-16.  SCSU is defined on Unicode scalar values, which are not tied to a 
particular CEF.

You can define an window in what SCSU calls "the expansion space" using the 
SDX or UDX tag and, in the best case, store N characters of Gothic or Deseret 
text in N + 3 bytes.  None of this has anything to do with surrogates or 
16-bitness.

In a message dated 2001-10-31 17:59:33 Pacific Standard Time, [EMAIL PROTECTED] 
writes:

>  I have no quarrel with the claim that the SCSU scheme could be
>  implemented directly on UTF-32 data. But as Unicode Technical Standard
>  #6 is currently written, that is not how to do it conformantly.

I have looked throughout UTS #6 and cannot find anything, explicit or 
implicit, to the effect that SCSU could not be conformantly implemented 
against UTF-32 data.  Sections 6.1.3 and 8.1 refer to how "surrogate pairs" 
may be encoded (*) in SCSU, but if you substitute the phrase "non-BMP 
characters" the meaning is identical.

(*) The word "encoded" was taken directly from UTS #6, section 8.1.

>  At the moment, if you want to compare SCSU-compressed text
>  against the UTF-32 form, you would have to convert the UTF-32
>  text to UTF-16, and then compress it using SCSU. You don't
>  apply SCSU directly to UTF-32 data.

Why not?  The fact that UTS #6 was originally written before UTF-32 was 
formally defined has nothing to do with this.  The same could be said for 
UTF-8, which (like SCSU) has a surrogate-free mechanism for representing 
non-BMP characters.

>  It seems to me that a rewrite of SCSU would be in order to explicitly
>  allow and define UTF-32 implementations as well as UTF-16 implementations
>  of SCSU.

I don't see anything that needs rewriting.  What are you seeing?

-Doug Ewell
 Fullerton, California




Re: How to print the byte representation of a wchar_t string with non -ASCII ...

2001-10-31 Thread DougEwell2

In a message dated 2001-10-31 10:07:44 Pacific Standard Time, 
[EMAIL PROTECTED] writes:

>  This is wrong.  wchar_t strings can of course be printed.  Reading the
>  ISO C standard would tell you to use
>
>printf ("%ls", wstr);
>
>  can be used to print wchar_t strings which are converted to a byte
>  stream according to the currently selected locale.  Eventually it has
>  to be wprintf() if the stdout stream is wide oriented.  Read the
>  standard.

Oops, sorry.  I hadn't read the standard, and so I didn't know that.  
(Probably shouldn't have answered William's question in that case.)

But won't this approach fail as soon as we hit a 0x00 byte (i.e. the high 8 
bits of any Latin-1 character)?

Also, Addison's response is correct that wchar_t is not going to be natively 
UTF-8.

-Doug Ewell
 Fullerton, California




Re: How to print the byte representation of a wchar_t string with non -ASCII ...

2001-10-31 Thread DougEwell2

In a message dated 2001-10-31 7:49:44 Pacific Standard Time, 
[EMAIL PROTECTED] writes:

>  For debugging purpose, I'd like to find out how I can print the byte
>  representation of a wchar_t string. 
>
>  Say in C, I have wchar_t wstr[10] = L"fran";
>  Is there any printf or wchar equivalent function (using appropriate format
>  template) that prints out the string as 
>  66 72 C3 A1 6E in en_US.UTF-8 locale under UNIX?

There's nothing built in, since wchar_t is intended to be a more abstract 
data type than that.

If you know the size of a wchar_t on your system, you can create a union 
containing a wchar_t and as many (1-byte) chars as you need.  For example, if 
a wchar_t occupies 2 bytes on your system, try:

union something
{
wchar_t w;
char c[2];
};

Then cast each wchar_t in the wide string to this union, and print the values 
of the plain char members.

-Doug Ewell
 Fullerton, California




Re: converting Unicode text into Unicode codes

2001-10-24 Thread DougEwell2

Nobody seems to have touched this one yet...

On 2001-10-22 at 15:35, Vadim Khaskel <[EMAIL PROTECTED]> wrote:

> I have question regarding tools available to convert Unicode 
> text into Unicode codes. We work on enhancement of our current product 
> and one of the new features is "Internationalization". Please let me 
> know if you may heard of such a tool. 

As Addison Phillips says in his signature block, "Internationalization is an 
architecture. It is not a feature."

You should clarify what you mean by "convert Unicode text into Unicode 
codes."  All computerized text, in Unicode or any other character set, is 
represented as a sequence of codes.  If the text is already "Unicode text," 
then by definition it is already encoded in "Unicode codes."

If you have text in another encoding, such as Latin-1 or Windows CP1252 or 
EBCDIC or whatever, and wish to convert it to Unicode, there is a handy tool 
called "recode" available as free software on the Internet.

If you already have Unicode text and wish to view the Unicode scalar values 
of the text (e.g. you want to display "Hi" as "U+0048 U+0069"), somebody 
could probably whip up a quick Perl script to do this.

But I think you need to explain more clearly what it is you have and what you 
want.

-Doug Ewell
 Fullerton, California




Re: [idn] An ignorant question about TC<-> SC

2001-10-23 Thread DougEwell2

In a message dated 2001-10-23 21:38:30 Pacific Daylight Time, 
[EMAIL PROTECTED] writes:

>  ...
>  But, if Unicode incorporate new rule to NFC, it may affect the usage
>  of chinese characters in any of those countries who are sharing CJK
>  area.

If the goal is to get Unicode/10646 to endorse a mapping scheme between 
Traditional Chinese and Simplified Chinese, whether as part of a 
normalization form or otherwise, then that discussion belongs on the Unicode 
mailing list (to which I have cross-posted this message).

The banner for this cause needs to be carried by someone else other than me, 
who (a) thinks it is a good idea and (b) has the technical knowledge about 
CJK to support it.

-Doug Ewell
 Fullerton, California

> On Tue, Oct 23, 2001 at 11:43:29PM -0400, [EMAIL PROTECTED] wrote:
>> In a message dated 2001-10-23 11:13:14 Pacific Daylight Time, 
[EMAIL PROTECTED] 
>> writes:
>>
>>>  On the other hand, one problem is more severe
>>>  than in the Chinese case: in the general case, a Serbo-Croatian
>>>  string written in Cyrillic cannot be distinguished, on a
>>>  character string basis, from uses of Cyrillic for other languages
>>>  (e.g., Russian), which should not be mapped and, similarly, a
>>>  string written in Roman-based characters cannot be distinguished,
>>>  on a character string basis, from the Roman-based characters of
>>>  another language (English?) which, again, cannot be mapped.
>>
>> But this problem *does* exist in the Chinese case, because certain Han 
>> characters can also be used to write Japanese or (I've been told) Korean.  
In 
>> a Japanese or Korean context, it wouldn't make any sense to map the 
correct 
>> "traditional" Han character to a simplified "equivalent"; the simplified 
>> character is only equivalent if the language is Chinese.
>
>  Dear Doug Ewell,
>
>  Even though your statement is not wrong, more clarification is needed.
>  There are two types of simplified chinese characters. One type is 
>  traditional (oops) one and the other type is relatively new one.
>
>  First type of traditional simplified chinese characters was invented
>  over long period of time among four countries (China, Korea, Japan, 
Taiwan).
>  Some of them are common to all for countries, some of them are only
>  used in one of those countries. (Let's call it type I)
>
>  Second type of relatively new simplified chinese characters was invented
>  around 50 years ago (I am not sure) by the People Republic of China
>  government. (Let's call it type II)
>
>  In Korea, we do not use type II. For type I, even though we do not
>  decided any policy, but we may easily prevent disputes by registration
>  policy.
>
>  But, if Unicode incorporate new rule to NFC, it may affect the usage
>  of chinese characters in any of those countries who are sharing CJK
>  area.




Re: [idn] REORDERING: stability issues and UTC solutions

2001-10-21 Thread DougEwell2

In a message dated 2001-10-21 18:27:30 Pacific Daylight Time, 
[EMAIL PROTECTED] writes:

> > Is this scheme of reordering for the sake of compression really UTC's
>  concern?
>
>  Actually, they might. See UTS#6 A Standard Compression.

I know a little about SCSU, having written the SCSU encoder and decoder used 
in the SC UniPad text editor, distributed by Sharmahd Computing (visit <
http://www.unipad.org> for more information).

SCSU works by defining 128-byte "windows" into the Unicode code space and 
using two slightly different mechanisms ("locking" and "non-locking") to 
refer to code points in those windows.  SCSU is intentionally a very 
lightweight solution, and does not depend on large tables of the sort 
required by the normalization forms or the proposed reordering scheme.  In 
particular, no reordering takes place in SCSU; code point order within each 
128-byte window is strictly adhered to.

Other forms of compression are unlikely to be of much interest to the UTC, 
which acknowledges that heavier-weight or application-specific compression 
schemes belong to the expertise of compression specialists or application 
vendors.

>  But of course, this discussion is not for IDN WG.

This is true, so I will not pursue it further on the IDN list.

Apologies to the Unicode list if this thread was inappropriate for 
cross-posting.

-Doug Ewell
 Fullerton, California




Re: [idn] REORDERING: stability issues and UTC solutions

2001-10-21 Thread DougEwell2

In a message dated 2001-10-21 7:22:07 Pacific Daylight Time, [EMAIL PROTECTED] 
writes:

>  My  question was that:
>1) newly-approved TAGALOG characters X,Y  have   NFC X -> Y,

This is simply not going to happen.  Unicode/10646 has promised us they are 
not going to add more compatibility characters, though, believe me, they have 
had plenty of requests.

>  Future TAGALOG may provides two sets of TAGALOG basic alphabets.
>  One set A in official lexicographical ordering and the other set B 
>  is in frequecy ordering (sub-optimal one OKAY) with 1:1 NFKC defined 
>  from A onto B.

Are you actually suggesting that UTC encode each character twice?

>  IF UTC accepts REORDERING as an official normalization form like 
>  NF-REORDERING , then we need no such tricks like above, and
>  TAGALOG support can be done within that NF in the new 
>  NAMEPREP steps:  mapping/NFKC/PROHIBIT and then NF-REORDERING .

Is this scheme of reordering for the sake of compression really UTC's concern?

I would suggest that Soobok read not only UTR #15, but also the page called 
"Unicode Policies" on the Unicode web site.  There is a very clear 
description of some things Unicode (and by extension ISO/IEC JTC1/SC2/WG2) 
simply will not do.  It may be a real eye-opener for some.

-Doug Ewell
 Fullerton, California




Re: Windows/Office XP question

2001-10-18 Thread DougEwell2

In a message dated 2001-10-18 8:00:23 Pacific Daylight Time, 
[EMAIL PROTECTED] writes:

>  One feature that
>  some systems have is composite fonts, where the "font" is actually a table
>  of subfonts in some order (perhaps with specific ranges assigned to each).
>  That way, someone can have the advantage of specifying a single font name,
>  and get a full repertoire, without requiring a monster font. Of course,
>  there may be little uniformity of style across scripts, or in mixtures of
>  symbols, but at least you can get legible characters instead of boxes.
>
>  Are there any plans to do something like that in Windows?

In fact, just recently I noticed that Notepad in Windows 2000 substitutes 
glyphs from Microsoft Sans Serif whenever the "real" font chosen by the user 
does not support that character (and possibly even in some cases when it 
does).  I haven't noticed any other applications doing this.

-Doug Ewell
 Fullerton, California




Re: ZWJ+ZWNJ+ZWJ (Was: ZWJ and Turkish)

2001-10-12 Thread DougEwell2

In a message dated 2001-10-11 10:50:09 Pacific Daylight Time, [EMAIL PROTECTED] 
writes:

> Why does Roozbeh think this is 'the worst thing in Unicode'?

Probably because it seems so ugly and hackish: 3 formatting characters 
required to provide shaping information for 2 graphic characters.  "Turn 
joining on; no, wait, turn it off; no, no, turn it on again."  I'm not saying 
there's a better way, though.

-Doug Ewell
 Fullerton, California




Re: [OT] ANN: Site about scripts

2001-10-11 Thread DougEwell2

Yes, of course that is what I meant.  Sorry if anyone was confused.

In a message dated 2001-10-11 8:55:22 Pacific Daylight Time, [EMAIL PROTECTED] 
writes:

> Doug Ewell wrote...
>
>  > Cyrillic was created as a better way to write Slavic languages, Russian 
in 
>  > particular.  Shavian and Deseret were created as better ways to write
>  > English.  The former met with overwhelming success, the latter did not
>
>  It's usual to bind "former" and "latter" to the closest preceeding pair of 
 
>  items, which led me to think you were talking of Shavian versus Deseret,  
>  and that was moderately confusing.  Shavian certainly can't be counted an  
>  overwhelming success when compared to Deseret...
>
>  In case anyone else is boggled... "Former" actually refers to Cyrillic and 
 
>  "latter" to Shavian and Deseret together.

-Doug Ewell
 Fullerton, California




Re: [OT] ANN: Site about scripts

2001-10-11 Thread DougEwell2

In a message dated 2001-10-10 9:16:17 Pacific Daylight Time, 
[EMAIL PROTECTED] writes to [EMAIL PROTECTED]:

>  You may consider trying to classify the artificial scripts a bit more.
>  For example I *think* (I'm a bit rusty on my Elvish) that for Tengwar would
>  be either Abjad (like Hebrew), or maybe Featural (like Hangul), and Cirth
>  would be Alphabet (like Runic).

Alternatively, you may consider moving the "artificial" classification to a 
subcategory inside the main categories (alphabet, abjad, syllabary, etc.), or 
even doing away with the distinction altogether.  *All* scripts are man-made 
and thus "artificial" in a sense.  The only exception to this would be if you 
subscribe to some article of religious faith that says a particular script 
was furnished directly by God.

Cyrillic was created as a better way to write Slavic languages, Russian in 
particular.  Shavian and Deseret were created as better ways to write 
English.  The former met with overwhelming success, the latter did not (to 
say the least), but the success of Cyrillic does not make the circumstances 
its creation any less "artificial."

Even Tengwar and Cirth, although created to support languages introduced in 
works of fiction, were invented according to some of the same principles that 
guide the creation of so-called "real" scripts.

-Doug Ewell
 Fullerton, California




Re: Code points for "al-Qaeda"

2001-10-03 Thread DougEwell2

In a message dated 2001-10-03 10:13:26 Pacific Daylight Time, 
[EMAIL PROTECTED] writes:

> Stefan Persson wrote:
>
>> Do you write "al-Qaeda" and "Osama bin Laden" in English? In Swedish
>> newspapers it's written "al-Qaida" and "Usama bin Ladin."
>
>  Usually.  The U.S. media seem to prefer "Taliban" universally,
>  but I have seen "Taleban" in publications from other anglophone
>  countries.  All these e's and o's in Arabic don't actually exist
>  in the writing system, and are imported from the colloquial.

CNN does use the spellings I gave (which Stefan asked about), although 
frequently they drop the hyphen from "al-Qaeda."  John is correct about 
"Taliban"; the BBC spells it "Taleban."

All these romanization discrepancies are exactly the reason I asked about the 
"true" Arabic spelling in Unicode in the first place.  Thanks to everyone for 
their (on-topic) contributions.

-Doug Ewell
 Fullerton, California




Deseret keyboard (was:Re: Special Type Sorts Tray 2001)

2001-10-02 Thread DougEwell2

In a message dated 2001-10-02 22:04:41 Pacific Daylight Time, 
[EMAIL PROTECTED] writes:

>> I still live in hopes that someone, John or someone else, will one
>> day send me a Deseret keyboard layout that is at least SLIGHTLY
>> standard (meaning more than one person has ever used it).
>>
>> I need something I can download and read on a Windows machine.
>> Text or a GIF would be fine.
>
>  I have a hard time picturing what such a layout would *look* like... what
>  the heck would someone who uses the language expect, anyway? :-)

Well, careful now.  The language is English.  You mean "someone who uses the 
script."

I tried creating a Deseret keyboard for (and with) SC UniPad, using the 
Dvorak keyboard layout as a loose model.  By that I do not at all mean that I 
mapped Latin letters on the Dvorak keyboard to "equivalent" Deseret letters, 
but rather that I put the most common letters (as determined from a large 
chunk of text in Deseret) on the home row and relegated the least common 
letters to Alt+Gr (Ctrl+Alt) combinations.  The biggest problem, of course, 
is that there are 38 of the buggers and so these Alt+Gr combinations are 
necessary.

My keyboard is all right, I guess, but it is completely my own invention and 
I really know nothing about the engineering that goes into proper keyboard 
design.  I'd feel better with something designed by someone who had a clue, 
and/or something that has seen some actual use.  Not that there are an awful 
lot of users, mind you.

-Doug Ewell
 Fullerton, California




Code points for "al-Qaeda"

2001-10-02 Thread DougEwell2

Like everyone else, I have suddenly become familiar in the past three weeks 
with the name "al-Qaeda," Arabic for "the base" and the name of Osama bin 
Laden's terror network.

I have also noticed the variations in pronunciation and romanized spelling, 
and being a bit more interested in such things than the typical American, it 
makes me curious:  How is "al-Qaeda" spelled in Arabic?

I know there are several list members who know the small amount of Arabic 
necessary to answer this question.  Please specify Unicode code points in the 
U+0600 block.

Thanks,

-Doug Ewell
 Fullerton, California




Re: Special Type Sorts Tray 2001

2001-10-02 Thread DougEwell2

In a message dated 2001-10-02 10:46:47 Pacific Daylight Time, 
[EMAIL PROTECTED] writes:

>> And I am sure Apple is hard at work on the Desert font and keyboard for Mac
>> OS 11? :-)
>
>  We've already added a Deseret glyph to the Last Resort font in 10.1. 
>  Beyond that, the Deseret Language Kit remains available at my Web 
>  site but doesn't work on X.  Yet.

I still live in hopes that someone, John or someone else, will one day send 
me a Deseret keyboard layout that is at least SLIGHTLY standard (meaning more 
than one person has ever used it).

I need something I can download and read on a Windows machine.  Text or a GIF 
would be fine.

I noticed that the LDS Church is listed as an associate member of Unicode.  I 
wonder if their representative might have anything.

-Doug Ewell
 Fullerton, California




Re: Special Type Sorts Tray 2001 (derives from Egyptian Transliteration Chara...

2001-10-02 Thread DougEwell2

Oops, I forgot something.

In a message dated 2001-10-02 4:50:03 Pacific Daylight Time, 
[EMAIL PROTECTED] writes:

>  if such a refusal has been declared then people who wish
>  to have these characters encoded may act knowing that the Unicode 
Consortium
>  will have legally estopped itself from making any future complaint that it
>  has some right to set the standards in such a matter 

The Unicode Consortium is a private, not-for-profit organization.  ISO/IEC 
JTC1/SC2/WG2 is an international standards working group.  I don't believe 
either is subject to the legal principle of estoppel.  Essentially, if they 
want to they can play Calvinball with the standard they are creating, 
although we all hope that does not happen.

-Doug Ewell
 Fullerton, California

("Calvinball" comes from the American comic strip "Calvin and Hobbes," in 
which a young boy plays a game with his stuffed tiger who comes to life, the 
main rule of which game is that the boy, Calvin, can change the rules at any 
time.)




Re: Special Type Sorts Tray 2001

2001-10-02 Thread DougEwell2

In a message dated 2001-10-02 4:50:03 Pacific Daylight Time, 
[EMAIL PROTECTED] writes:

>  Is there an official Unicode Consortium statement that states, for the
>  record, that the Unicode Consortium refuses to encode more ligatures and
>  precomposed characters please?

I'm pretty sure there is, since it has been brought up so often by UTC 
members on this list.  If there is no such statement, then one should be 
drafted.

>  I feel that this is a matter that needs to be formally resolved one way or
>  the other, so that, if such a refusal has been declared then people who 
wish
>  to have these characters encoded may act knowing that the Unicode 
Consortium
>  will have legally estopped itself from making any future complaint that it
>  has some right to set the standards in such a matter and that those people
>  who would like to see the problem solved and ligatured characters encoded 
as
>  single characters so that a font can be produced may proceed accordingly,
>  perhaps approaching the international standards body directly if the 
Unicode
>  Consortium refuses to do so without a process of even considering 
individual
>  submissions on their individual merits.  On the other hand, if no such
>  formal statement has been issued, then those people who would like to see
>  the problem solved and ligatured characters encoded as single characters so
>  that a font can be produced for use with software such as Microsoft Word 
may
>  proceed to define characters in the private use area in a manner compatible
>  with their possible promotion to being regular unicode characters in the
>  presentation forms section.

Was that only two sentences?  Wow

Regarding the "refusal" to encode more ligatures and precomposed presentation 
forms: It is not arbitrary.  There is a reason why Unicode will not encode 
these things.  They would interfere with the established standard for 
decomposition.  Now that Unicode has reached its present level of popularity, 
some vendors and implementations (and standards) require a stable set of 
decomposable code points.  That set is Unicode 3.0.  If new precomposed 
characters were added, engines and standards that were built to the new 
standard would decompose them differently from those built to the old 
standard, and this is not acceptable to those who need decomposition to work 
at all.

Precomposed characters and ligatures won't be considered "on their individual 
merits," and they won't be "promoted" from a private standard to true Unicode 
character status, because the decomposition problem is bigger than the 
individual merits.  Note that I personally like the ct ligature and think it 
would be a great thing to have in a font.  If this were 1993, perhaps it 
might have been encoded.

Regarding fonts: Nothing is stopping you or anyone else from making a font 
with these precomposed glyphs and associating them with Unicode PUA (Private 
Use Area) code points.  That is an excellent illustration of a possible use 
of the PUA, and many, many font vendors do just that.  

>  I feel that it would be quite wrong to pull up the ladder on the 
possibility
>  of adding characters such as the ct ligature as U+FB07 without the
>  possibility of consideration of each case on its merits at the time that a
>  possibility arises.  A situation would then exist that several ligatures
>  have been defined as U+FB00 through to U+FB06 including one long s 
ligature,
>  yet that U+FB07 through to U+FB12 must remain unused even though they could
>  be quite reasonably used for ct and various long s ligatures so as to
>  produce a set of characters that could be used, if desired, for 
transcribing
>  the typography of an 18th Century printed book.  Yet, if the ladder has 
been
>  pulled up, perhaps U+FB07 can be defined as the ct ligature directly by the
>  international standards organization and the international standards
>  organization could decide directly about including the long s ligatures.

The organization you are talking about is ISO/IEC JTC1/SC2/WG2.  They are 
firmly committed to maintaining compatibility between Unicode and ISO/IEC 
10646.  Sorry, but this is a good thing.

>  If the possibility of fair consideration is, however, still open, then the
>  ct ligature could be defined as U+E707 within the private use area and
>  published as part of an independent private initiative amongst those 
members
>  of the unicode user community that would like to be able to use that
>  character in a document by the character being encoded as a character in an
>  ordinary font file.  That would enable font makers to add in the ct
>  character if they so choose.

You might start by checking existing fonts, especially those shipped with 
major operating systems, to see what PUA code points are commonly used 
internally for glyphs not associated with a standard Unicode character.  I 
know that several Windows fonts have privately assigned glyphs, and I assume 
the same is true for Macint

Re: Special Type Sorts Tray 2001

2001-09-30 Thread DougEwell2

In a message dated 2001-09-30 9:19:31 Pacific Daylight Time, 
[EMAIL PROTECTED] writes:

>  I have been thinking recently that it would be useful to have presentation
>  forms for a ct ligature character and various long s ligatures so that one
>  may transcribe printed works from the 18th century into unicode while
>  keeping the typographic style intact.

As mentioned, this can already be done with ZWJ, although fonts may not be 
able to render it correctly.  (But this is always true for any newly added 
glyph, no matter how encoded.)

>  In view of these various situations and possibly various others that people
>  might like to post into this thread, I write to put forward the suggestion
>  that as a discussion on this list various users of the unicode
>  specification might like to agree informally a collection of characters
>  called Special Type Sorts Tray 2001 or STST2001 to be defined in the 
Private
>  Use Area in, say, the range U+E700 through to U+E7FF in the hope that
>  perhaps by there being some informal agreement perhaps someone with a font
>  generating package might like to add them into a font and maybe various
>  small yet significant benefits to the facilities available for encoding 
text
>  might be achieved.

You might want to take a look at the ConScript Unicode Registry, which was 
originally intended for "constructed" and artificial scripts, but which could 
also be used for this purpose.

>  Please know that I am specifically suggesting that this be a discussion
>  amongst the user community: I am not suggesting that the Unicode Consortium
>  endorse this suggestion as I am fully aware that the rules for the use of
>  the Private Use Area specifically say that no assignment to a particular 
set
>  of characters will ever be endorsed by the Unicode Consortium.

OK, then ConScript might be a suitable venue for this proposed encoding after 
all.

>  I declare an interest in the choice of U+E700 to U+E7FF as the range for
>  STST2001 in that I have been defining and publishing,

This range is already taken in ConScript, but several other ranges are 
available, and as David mentioned, you'll probably need a lot more than 256 
code points.

ConScript is the work of Michael Everson and John Cowan.  You should check 
with them.

http://www.evertype.com/standards/csur/index.html
http://www.evertype.com/standards/csur/conscript-table.html

-Doug Ewell
 Fullerton, California




Re: Shape of the US Dollar Sign

2001-09-28 Thread DougEwell2

In a message dated 2001-09-28 10:30:12 Pacific Daylight Time, 
[EMAIL PROTECTED] writes:

>> Here's Arnold's chance to ask everyone to send him samples.
>  
> And many of them too, until he gets some with a dollar sign on it. None of
> the banknotes I have in my wallet ($1, $10, and $20) show a dollar sign on
> them! Or It's hidden somewhere, in which case, where?

None of the modern "small-size" notes (issued since 1929) show a dollar sign, 
except for the $100,000 gold certificate which was only issued for inter-bank 
use.  Many of the large-size notes (issued before 1929) did show a dollar 
sign.

-Doug Ewell
 Fullerton, California




Re: Egyptian Transliteration Characters

2001-09-26 Thread DougEwell2

In a message dated 2001-09-26 8:09:18 Pacific Daylight Time, 
[EMAIL PROTECTED] writes:

>> The problem is, I have a couple of German texts that I plan to
>> transcribe, where all I need is HYPHEN WITH DIARESIS.
>
> So, you type HYPHEN or EN DASH and then COMBINING DIAERESIS ABOVE.

I think that was David's point, that these things are always possible using 
combining characters, and the argument "but it's easier with a precomposed 
character" doesn't stand up to the concerns about proliferation and 
normalization.

-Doug Ewell
 Fullerton, California




Re: GB18030

2001-09-24 Thread DougEwell2

In a message dated 2001-09-24 20:50:25 Pacific Daylight Time, 
[EMAIL PROTECTED] writes:

>> Does GB18030 DEFINED the mapping between GB18030 and the rest of 11 planes?
>> I don't think so, since Unicode have not define them yet, right ?
>
> Unicode defined all the planes, a long long time ago. It's added
> characters for 3 of them - Plane 1 (basically the overflow area for the
> non-CJK part of the BMP), Plane 2 (more ideographs) and Plane 14
> (special tag characters).

David's absolutely right.  This is another common misconception, about 
Unicode "not defining" the code space unless characters are actually assigned 
to all the code points.

This kind of thinking led, in part, to all the complacency on the part of 
database vendors and others concerning the need to support surrogate code 
points.  They thought that just because no characters had YET been assigned 
to non-BMP code points, they could safely ignore the whole issue of surrogate 
processing.  Then, when non-BMP characters became a reality, we began to see 
kludges like CESU-8.

-Doug Ewell
 Fullerton, California




Re: UTF-8 <> UCS-2/UTF-16 conversion for library use

2001-09-24 Thread DougEwell2

In a message dated 2001-09-24 11:16:29 Pacific Daylight Time, 
[EMAIL PROTECTED] writes:
 
> For many application UTF-16 is a good compromise between a large code size
> and processing efficiency.  As this industry changes the decision points
> change.  Then there is always the great argument that many applications that
> were written for UCS-2 are much easier to convert to UTF-16.

This last argument is especially compelling when you consider the large 
number of people (programmers and others) who still think "Unicode" means 
"16-bit" and whose entire concept of supporting Unicode is using WCHARs and 
letting library string functions do the conversions automagically.

-Doug Ewell
 Fullerton, California




[OT] Roman numeral arithmetic (was: Re: [lojban] (from lojban-beginners) pi'e)

2001-09-22 Thread DougEwell2

In a message dated 2001-09-22 11:35:16 Pacific Daylight Time, 
[EMAIL PROTECTED] writes:

>> I would be fascinated to see some sort of evidence that addition and 
>> subtraction is easier in Roman numerals than in Hindu-Arabic ("European") 
>> numerals.
>
>  I + I = II
>  X + X = XX
>  X + X + X = XXX
>  C + X = CX
>  CX - X = C

For these carefully chosen examples, sure, but what about:

III + IX = XII
XXIV + XXVII = LI
C - I = XCIX

etc.  This is no better than European digits, and it feels a little like 
doing math with pounds, shillings, and pence.

In a message dated 2001-09-22 11:59:22 Pacific Daylight Time, 
[EMAIL PROTECTED] writes:

> Addition and subtraction definitely isn't easier in Roman numerals that in 
> Hindu-Arabic numerals. As I understood Edward's words, addition and 
> subtraction is easier *than multiplication and division* in Roman numerals.

Edward's words were originally as follows (whole paragraph this time, to 
avoiding quoting out of context):

> We generally believe that the mathematicians led by Leonardo Fibonacci won
> out over the Old Guard in replacing Roman numerals with Hindu-Arabic
> numerals, but the victory was long drawn out, and is still incomplete.
> Businesses continued to use Roman numerals for several centuries (because
> addition and subtraction is easier in Roman numerals, and they didn't have
> that much call for multiplication and division until interest became
> socially acceptable). In Fibonacci's time, and up until the 17th century,
> clocks existed almost exclusively in churches and monasteries, where they
> regulated the hours of prayer. The Catholic church was having nothing to do
> with these new-fangled heathen numbers, especially since the Crusades were
> on at the time, and continued for centuries. The Church was especially
> opposed to the idea of zero, both as the work of the infidel, and on
> Aristotelian grounds. That is why the the twelve-hour system starts at XII,
> followed by I, and not 0:00:00.

Reading it again, I still wouldn't have interpreted the passage about 
"addition and subtraction is easier..." to mean "... than multiplication or 
division," but obviously this is true (also for European numerals, but less 
so than for Roman numerals).

In any case I am scratching my head as to how this ended up on the Unicode 
list, unless it was somehow seen as relevant to the earlier discussion about 
sorting text that contains both European and Arabic-Indic digits.

-Doug Ewell
 Fullerton, California




Re: [lojban] (from lojban-beginners) pi'e

2001-09-22 Thread DougEwell2

In a message dated 2001-09-22 0:02:39 Pacific Daylight Time, 
[EMAIL PROTECTED] writes:

> Businesses continued to use Roman numerals for several centuries (because
> addition and subtraction is easier in Roman numerals,

I would be fascinated to see some sort of evidence that addition and 
subtraction is easier in Roman numerals than in Hindu-Arabic ("European") 
numerals.

-Doug Ewell
 Fullerton, California




Re: PDUTR #26 posted

2001-09-18 Thread DougEwell2

In a message dated 2001-09-18 9:22:17 Pacific Daylight Time, 
[EMAIL PROTECTED] writes:

>  Doug Ewell wrote:
>> All Unicode code points of the form U+FE and U+FF are 
>> special, in 
>> that they are non-characters and can be treated in a special way by 
>> applications (e.g. as sentinels).
>
>  I think this should be "All Unicode code points of the form U+xxFFFE and
>  U+xx are special [...]", otherwise also legal codes such as U+00FF 
("ÿ")
>  would be included in the statement.

Oops!  One of two "Unicode 101" mistakes I made in the same day.  Where was 
my brain?

-Doug Ewell
 Fullerton, California




Re: PDUTR #26 posted

2001-09-18 Thread DougEwell2

David Hopwood and Carl Brown graciously corrected me:

>> I don't agree that irregular UTF-8 sequences in general can only decode to
>> characters above 0x.
>
>  That's why I specifically referred to irregular sequences as defined by
>  Unicode 3.1 (i.e. UAX #27).

I stand corrected.  That's what I get for not having a copy of UAX #27 handy.

Non-shortest sequences, of course, used to be considered irregular (not 
invalid) in Unicode 3.0, before the Technical Committee wisely tightened up 
the definition of UTF-8.

-Doug Ewell
 Fullerton, California




Re: PDUTR #26 posted

2001-09-17 Thread DougEwell2

In a message dated 2001-09-17 16:24:05 Pacific Daylight Time, 
[EMAIL PROTECTED] writes:

> It doesn't reopen that specific type of security hole, because irregular 
UTF-8
> sequences (as defined by Unicode 3.1) can only decode to characters above
> 0x, and those characters are unlikely to be "special" for any 
application
> protocol. However, I entirely agree that it's desirable that UTF-8 should 
only
> allow shortest form; 6-byte surrogate encodings have always been incorrect.

All Unicode code points of the form U+FE and U+FF are special, in 
that they are non-characters and can be treated in a special way by 
applications (e.g. as sentinels).

I don't agree that irregular UTF-8 sequences in general can only decode to 
characters above 0x.  For example, the following irregular UTF-8 
sequences all decode to U+:

C0 80
E0 80 80
F0 80 80 80
F8 80 80 80 80
FC 80 80 80 80 80

It is true that the *specific* irregular UTF-8 sequences introduced (and 
required) by CESU-8 decode to characters above 0x when interpreted as 
CESU-8, and to pairs of surrogate code points when (incorrectly) interpreted 
as UTF-8.  Since definition D29, arguably my least favorite part of Unicode, 
requires that all UTFs (including UTF-8) be able to represent unpaired 
surrogates, the character count for the same chunk of data could be different 
depending on whether it is interpreted as CESU-8 or UTF-8.  That's a 
potential security hole.

CESU-8 decoders that are really diligent could check for this, of course, but 
when I think of CESU-8 the concept of "really diligent decoders" just doesn't 
spring to mind.  If the inventors were really diligent, they would have 
implemented UTF-16 sorting correctly in the first place.

-Doug Ewell
 Fullerton, California




Re: CESU-8: to document or not

2001-09-17 Thread DougEwell2

In a message dated 2001-09-17 13:06:16 Pacific Daylight Time, 
[EMAIL PROTECTED] writes:

> I agree that there is a world of software out there that does not support
> Unicode 3.1 yet.  Toby has a legitimate problem.  It is the proposed 
solution
> that bothers me.  For now I suspect that living with the BMP restrictions
> should not pose a severe hardship on most systems today.  Moving on to fully
> implement the full Unicode range should be the carrot to upgrade current 
code.
> PDUTR #26 is the wrong way to go because it puts a new demand on systems
> that have already converted.  It also creates more work by doing things 
twice.
> Adding proper library of CESU-8 support functions is probably more work
> than upgrading from UCS-2 to UTF-16.

To reiterate my earlier point:  Restricting potential CESU-8 implementations 
to BMP characters only (i.e. UCS-2) should not be a significant limitation, 
since there are almost certainly no supplementary characters in Oracle or 
Peoplesoft databases.

Can Jianping, Toby, or someone else from Oracle or Peoplesoft please address 
this question of supplementary characters?  The whole purpose of CESU-8 is to 
handle supplementary characters differently from the way UTF-8 handles them.  
What supplementary characters are being currently handled in a CESU-8 way 
that must not be corrected?

-Doug Ewell
 Fullerton, California




Re: PDUTR #26 posted

2001-09-17 Thread DougEwell2

In a message dated 2001-09-17 4:25:47 Pacific Daylight Time, 
[EMAIL PROTECTED] writes:

>> How should an UTF-8 application behave if it accidentally receives
>> a CESU-8 surrogate sequence?  How does an application which
>> relies on CESU-8 binary sorting behave if it accidentally receives an
>> UTF-8 4-byte sequence?
>
> Both should error out. In practice, I wonder how common it would be and
> because of this how many people will actually do THAT in their parsers. I
> expect lots of non-compliant parsers.

If Michka is referring to non-compliant CESU-8 parsers, I really wouldn't 
care much because CESU-8 is supposed to live in its own little private world. 
 But if people start compromising their UTF-8 parsers to accommodate CESU-8 
"adaptively," it would be a great blow to UTF-8.  It would essentially undo 
all the tightening-up that was accomplished by the Corrigendum, and it would 
revive all the old Bruce Schneier-style skepticism about the "security" of 
Unicode.

-Doug Ewell
 Fullerton, California




Re: CESU-8 vs UTF-8

2001-09-16 Thread DougEwell2

In a message dated 2001-09-16 13:13:38 Pacific Daylight Time, 
[EMAIL PROTECTED] writes:

>  I think that some applications will find it easier to migrate to
>  UTF-32 rather than convert to UTF-16.

I know I have.  Handle everything internally as UTF-32, then read and write 
UTF-8 or UTF-16 as appropriate.

>  CESU-8 breaks that model becasue it is a form of Unicode with the sole
>  purpose of supporting a non-Unicode code point order sort order.  Yes I
>  could devise a way to sort UTF-32 and UTF-8 in UTF-16 binary sort order but
>  that is only a matter of some messy code.  The real issue is that I must 
now
>  handle Unicode that has as part of it essential property that it must
>  survive transforms with two distinctly different sort orders.

I was glad when Unicode began moving away from the doctrine of "treat all 
characters as 16-bit code units" and toward "treat them as abstract code 
points in the range 0..0x10."  Make no mistake, UTF-16 can be a useful 
16-bit transformation format; but it should not be considered the essence of 
Unicode, especially not to the point where additional machinery needs to be 
built on top of the Unicode standard solely to support UTF-16.

>  With this standard approved my applications can be compelled to use CESU-8
>  in place of UTF-8 if I was to talk to Peoplesoft or other packages that 
will
>  insist on this sort order of Unicode data.  If I use UTF-8 as well, then I
>  will need two completely different sets of support routines.

Actually, what you will need is *one* routine that works with both UTF-8 and 
CESU-8, but breaks the definition of both in doing so, by permitting either 
method of handling supplementary characters, and auto-detecting the data as 
UTF-8 or CESU-8 based on the method encountered.

>  My problem is that the correct approach is for people like Peoplesoft to 
fix
>  their code before accepting non BMP characters.

Still unanswered, in this proposal to sanctify hitherto non-standard 
representations of non-BMP characters in commercial databases, is the 
question of how much non-BMP data even exists in commercial databases in the 
first place.  I know I personally have some (and will soon have more, now 
that SC UniPad supports Deseret), but what about users of Oracle and 
Peoplesoft databases?  Other than the private-use planes, it was not even 
allowable to use non-BMP characters until the release of Unicode 3.1 earlier 
this year.  Where is the great need for a compatibility encoding?

-Doug Ewell
 Fullerton, California




Re: CESU-8 vs UTF-8

2001-09-15 Thread DougEwell2

"Carl W. Brown" <[EMAIL PROTECTED]> writes:

> This is not a true statement.  "It is not intended nor recommended as an
> encoding used for open information exchange." is false.  Its intent is to
> layout a format encoding between Oracle and Peoplesoft code in the hopes
> that they can get other database vendors to support it.  They are really
> asking for a public standard not a private implementation.
>
> If it were only an internal protocol used internally by a single
> vendor they
> would not be submitting a UTR.

Exactly.  If CESU-8 were intended only as an internal representation, it 
would not matter whether it had any official recognition or blessing from 
Unicode.  I can store Unicode data internally any way I want, using UTF-17 
[1] if I choose, and there is nothing non-conformant about this as long as I 
treat the data as scalar values and can convert to the real UTFs for data 
exchange purposes.  To propose CESU-8 in a Technical Report is, as Carl said, 
an attempt to make it an official, public standard.

> 1) Why ask for an IANA character set designation for "internal use within
> systems processing Unicode"?  This is a definite indication that the real
> intent goes well beyond even the multi-vendor application to data base
> interfaces.  It is apparent that the real intent is the use the force of
> standards not only to compel the major database developers to offer support
> for CESU-8 but to make it a public internet standard as well.

This section of the TR amazed me.  In the Summary and elsewhere, CESU-8 "is 
not intended nor recommended as an encoding used for open information 
exchange," but by the end of the document we learn that it will be registered 
with the Internet Assigned Numbers Authority.  I have spelled out IANA for a 
reason, to highlight that it is a body dealing with open information exchange 
over the Internet.  This completely refutes all of the "internal use only" 
claims made in the rest of the document.

> Is there an alternative?  Yes.  You must use special code to
> compare UTF-16.
> If you use the OLD UCS-2 code it will give you the unique UTF-16 compare
> problem.  However by adding two instructions to the compare that add very
> little overhead, you can provide a Unicode code point compare routine that
> sorts in exactly the same order as UTF-32 & UTF-8.

This was my solution long ago: fix the code that sorts in UCS-2 order so that 
supplementary characters are sorted correctly.  In case there is any 
disagreement about this, sorting by UCS-2 order has been WRONG ever since 
surrogates and UTF-16 were invented.

However, the database vendors' position is that there is now data sorted in 
this way, and it cannot be changed or database integrity will be compromised. 
 Fine, there is another alternative: sort all data in UCS-2 order, regardless 
of the encoding scheme.  This takes, as Carl said, about two lines of code.  
You don't lose any significant processing time, and you DON'T need to invent 
a new encoding scheme.

> 2) The time is now to add the specification of code point order compare
> support for systems, databases and libraries offering UTF-16 support before
> Unicode systems are split into two different migration paths for future
> multi plane character support and while vendors are upgrading from UCS-2 to
> UTF-16 support.

Unicode has, understandably, avoided recommending binary code point order, 
referring people instead to the Collation Algorithm for culturally correct 
sorting.  This is good because it alerts designers of most applications to 
the real issues surrounding collation.  For database applications, however, 
there is a need for binary code point order that has more to do with 
consistency than cultural correctness.  I accept this, but still contend that 
you can sort UTF-8 data in UCS-2 code point order quickly and easily, without 
the need for CESU-8 at all, let alone the need to enshrine it in a TR.

There was a lot that I liked in this PDUTR.  The misleading name "UTF-8S" has 
been replaced, and there are all those caveats that CESU-8 is not, not, NOT 
to be used in open data exchange.  None of these caveats, however, can be 
taken seriously as long as Section 4, "IANA Registration," is present.

I suggest, as part of the Proposed Draft stage for this document, that 
Section 4 be deleted and that IANA be informed that CESU-8 is intended as an 
internal encoding only and that they are explicitly requested NOT to register 
it.

-Doug Ewell
 Fullerton, California

[1]  UTF-17 was a *humorous* description of an exceedingly inefficient 
Unicode character encoding scheme.  It was not proposed seriously and does 
not contribute to the proliferation of UTFs.




Re: [OT] o-circumflex

2001-09-08 Thread DougEwell2

In a message dated 2001-09-08 12:00:43 Pacific Daylight Time, 
[EMAIL PROTECTED] writes:

>  I know the Real Academia Española decided to do away with "ch" and "ll" in
>  1994, but do you know if the other Spanish speaking countries' 
corresponding
>  academies done the same?

I have no idea.  I don't know which, if any, even have a language academy.

-Doug Ewell
 Fullerton, California




Re: [OT] o-circumflex

2001-09-07 Thread DougEwell2

In a message dated 2001-09-07 17:19:49 Pacific Daylight Time, 
[EMAIL PROTECTED] writes:

>  You are quite correct that is why Unicode support differing collation
>  strengths.  Some times you only care about the actual letters without
>  diacritics.  But even then letters are locale sensitive.  For example the
>  Danish alphabet starts with an A and ends it with A ring above.  A Dane
>  would look for Alborg near the end of a list of towns.  It is like having
>  the Spanish ch follow cz.

That would be Ålborg, right?

I hasten to add that Carl's Spanish example is for the so-called "traditional 
sort," in contrast to the "modern sort" in which "ch" sorts simply as "c" 
followed by "h".  In many Spanish-speaking communities, particularly here in 
Alta California, the simplified "modern" sort is by far the more common of 
the two.

-Doug Ewell
 Fullerton, California




Re: japanese xml

2001-09-03 Thread DougEwell2

In a message dated 2001-09-03 18:02:09 Pacific Daylight Time, 
[EMAIL PROTECTED] writes:

>   [XML], however, provides little information on existing CESs already
>   in use for the interchange of Japanese characters. Such CESs are
>   allowed as mere options among many others. Furthermore, [XML] says
>   nothing about the appropriate CESs for each protocol (e.g. SMTP or
>   HTTP) and those for information exchange files.
>
>   The mapping between such existing CESs and [ISO/IEC10646]/[Unicode
>   3.0] is not specified either. Some mutually different conversions are
>   in use, and thus different XML processors may emit different outputs.

I didn't think it was the purpose of the W3C or the XML specification to 
define mapping tables between Unicode/10646 and other encodings.  XML itself 
supports the use of just about any ASCII- or EBCDIC-compatible encoding you 
like, as long as you declare it in the XML header.  Whether it gets 
interpreted correctly, or at all, is up to the XML processor.  Not every 
processor will necessarily understand every encoding.

If there are two or more different mappings between Unicode/10646 and some 
other encoding -- say, JIS X0208 -- then different XML processors certainly 
may emit different outputs.  That is not XML's fault, and it is not Unicode's 
fault either.  Unicode provides mapping tables to a wide variety of 
encodings.  I would use those if it were up to me.

-Doug Ewell
 Fullerton, California




  1   2   3   >