Han Radical-Stroke Index

2002-05-13 Thread William Overington

In chapter 15 of the Unicode specification is the statement that the Han
Radical-Stroke Index is available as a separate file.  I have tried to find
it on the web site with no success.  Is this file available on the web site
please?

William Overington

13 May 2002







CJK Unified Ideographs Extension B

2002-05-13 Thread William Overington

I have been looking at the characters in the CJK Unified Ideographs
Extension B document.  These are the characters from U+02 through to
U+02A6DF, which, as I understand it, are the rarer CJK characters.

I wonder if any of the people who read this list who understand the
languages involved might please like to say what any one or two of these
characters, of their choice, mean please, just as a matter of general
cultural interest for people who see these characters in the Unicode
specification and, though not themselves knowledgeable of the languages,
find the characters interesting for their artistry and history.

William Overington

13 May 2002






Re: Private Use Surrogate Pairs (128x1024 - 4)

2002-05-13 Thread William Overington

The 128 is from all 128 possible permutations of 0 and 1 as the seven least
significant bits of any high surrogate in the range U+DB80 through to
U+DBFF.  The 1024 is from all possible permutations of the ten least
significant bits of any low surrogate in the range U+DC00 through to U+DFFF.

Consider any of the 128 high surrogates and any of the 1024 low surrogates
that are mentioned above.  Take the ten least significant bits of that high
surrogate as a number and multiply that number by 1024.  To that result, add
the ten least significant bits of the low surrogate, then add the ordinary
base 10 number 65536, which is, in binary, 1 followed by 16 zeros.  This
means that the result will be in the range of U+F through to U+10.

Section 3.14 of chapter 13 of the Unicode specification and section 3.7 of
chapter 3 of the Unicode specification have details of the method.

When I first met with the idea of surrogates they seemed very complicated
with the addition of 65536 seeming strange.  However, I have since come to
the view that it is a very clever method of adding in sixteen extra planes
each of 65536 code points without having any of the code points which are
produced using surrogates being duplicates for code points that are already
in the basic 16 bit code plane.  Thus Unicode has seventeen planes each of
65536 code points, by having the original plane of 65536 code points
together with an additional sixteen planes each of 65536 code points.
However, two code points for each plane are unused, namely those ending
hexadecimal FFFE and hexadecimal .

William Overington

13 May 2002






Re: To submit or not to submit

2002-05-13 Thread Michael Everson

At 10:14 +0800 2002-05-13, Amir Herman wrote:

>Dear Roozbeh,
>
>I would strongly suggest that instead of correcting the U+6AC, we 
>add another glyph of 'GA' of letter 'KEHEH' (U+6A9) with dot above. 
>It is not 100% wrong of saying
>that existing U+6AC represent 'GA' for old malay. Only that the 
>glyph is in 'rare' shape and seldom being used in normal writing. 
>The major problem for this glyph is it
>can't represent the character when we want to use it as a first and 
>middle position of a word. With the existing 'GA' of U+6AC, we can 
>only use it when 'GA' come as
>the last character of a word. The same case actually goes for 'KAF' 
>in arabic. Although for Unicode it refers to U+643 as 'KAF' but in 
>writing this glyph is seldom being
>used due of its limitation of 'shaping transformation'. Same as the 
>case of U+6AC, the U+643 cannot represent the 'KAF' when it located 
>at the first or middle of a
>word. Thats why in Arabic writings including Qur'an itself the 
>letter 'KAF' mostly represent by U+6A9.

I tend to agree; it will be a lot easier on everyone if we disunify 
the new Jawi character from characters used for other purposes. At 
the Beijing meeting of WG2 representative who spoke Jawi was on about 
this, and that was some years ago. Now it's raised its head again. it 
seems to me that fiddling with descriptions to indicate that the 
diacritics on a character will change shape and position depending on 
language is more complex than just encoding a Jawi GA.

Otherwise you're going to have to do with how to deal with 
multilingual text; does this "solution" exist for other Arabic 
characters?
-- 
Michael Everson *** Everson Typography *** http://www.evertype.com




RE: To submit or not to submit

2002-05-13 Thread Marco Cimarosti

John Hudson [mailto:[EMAIL PROTECTED]]
> Amir, you are misunderstanding the nature of Unicode. Unicode is a 
> *character* encoding standard, and the glyphs in the charts 
> are intended only as a visual guide suggesting normative shapes
> for those characters.

On the other hand, we know that this sound principle has often been
"violated" for a variety of historical reasons.

Talking about kaf's, you must convene that U+0643 (ARABIC LETTER KAF ?),
U+06A9 (ARABIC LETTER KEHEH ?), and U+06AA (ARABIC LETTER SWASH KAF ?) are
just glyphic variations of the same letter.

These variations could as well have been unified, and handled via smart
fonts, but Unicode decided for disunification.

On the basis of this precedent, and on the basis of the fact that sample
glyphs on old copies Unicode book's will have a strong influence on font
designers for years, it may be wise in this case to leave the old
U+0643-like character alone and add a new U+06A9-like characters.

I would say that this would be a prudent choice in the case of an
international script like Arabic: who can make an oath that no language's
orthography ever used a letter like U+0643 with a dot above?

_ Marco




RE: regarding unicode support in Oracle8i

2002-05-13 Thread Marco Cimarosti

Doug Ewell wrote:
> Michael Yau  wrote:
> 
> > In Oracle9i, the NCHAR datatypes are exclusively Unicode datatypes
> > and supports both UTF-16 and UTF-8.
> 
> s/UTF-8/CESU-8/

update ORACLE set ENCODING='CESU-8' where ENCODING='UTF-8'

_ Marco




Re: CJK Unified Ideographs Extension B

2002-05-13 Thread Ping Yeh

Take U+2A009 and U+2A00A as examples, I haven't seen anyone
use any of these in my life, yet from the radicals it seems
that both are names of some kind of birds.

Ping

William Overington wrote:

> I have been looking at the characters in the CJK Unified Ideographs
> Extension B document.  These are the characters from U+02 through to
> U+02A6DF, which, as I understand it, are the rarer CJK characters.
> 
> I wonder if any of the people who read this list who understand the
> languages involved might please like to say what any one or two of these
> characters, of their choice, mean please, just as a matter of general
> cultural interest for people who see these characters in the Unicode
> specification and, though not themselves knowledgeable of the languages,
> find the characters interesting for their artistry and history.
> 
> William Overington
> 
> 13 May 2002
> 






Re: To submit or not to submit

2002-05-13 Thread Roozbeh Pournader


On Mon, 13 May 2002, Amir Herman wrote:

> As conclusion, I would say that we can still preserve the existing U+6AC
> because it is not wrong, only the glyph is not standard and limited in
> its use. Later on I might send some images to clarify my argument. The
> task now is to add another glyph that could present the standard and
> most common glyph of 'GA'. To view a full standard jawi alphabet please
> refer to:
> 
> http://www.omniglot.com/writing/malay.htm
> http://www.linguistsoftware.com/ljawi.htm

It seems that your references don't agree! Although both have a Keheh With 
Dot Above, one has Kaf, and another has Keheh in the list of letters.

BTW, after hearing Marco's and others' comments, and seeing the examples,
I believe that we should not touch the old character, but we should encode
a new ARABIC LETTER KEHEH WITH DOT ABOVE. I will look forward to see a 
proposal, with samples from published documents.

roozbeh





Re: Useful Resources - Another round of spring cleaning

2002-05-13 Thread rajesh


- Original Message - 
From: "Markus Scherer" <[EMAIL PROTECTED]>
To: "Magda Danish (Unicode)" <[EMAIL PROTECTED]>
Cc: "unicode" <[EMAIL PROTECTED]>
Sent: Friday, May 10, 2002 5:11 AM
Subject: Re: Useful Resources - Another round of spring cleaning


> Google+ :
> 
> markus
> 
> Magda Danish (Unicode) wrote:
> 
> > - Akkadian   
> 
> 
> http://www.sron.nl/~jheise/akkadian/
> 
> 
> > - Multilingual Project Gutenberg
> > http://www.informatik.uni-hamburg.de/gutenb
> 
> 
> http://www.sharat.co.il/pg/
> 
> > - USMARC to UNIVERSAL CHARACTER SET MAPPINGS
> > http://lcweb.loc.gov/marc/marc2ucs.html


Above link doesn't work. I am curious to visit the site.

rajesh







Re: Han Radical-Stroke Index

2002-05-13 Thread John H. Jenkins


On Monday, May 13, 2002, at 04:02 AM, William Overington wrote:

> In chapter 15 of the Unicode specification is the statement that the Han
> Radical-Stroke Index is available as a separate file.  I have tried to 
> find
> it on the web site with no success.  Is this file available on the web 
> site
> please?
>
>

The current version is at .

==
John H. Jenkins
[EMAIL PROTECTED]
[EMAIL PROTECTED]
http://homepage.mac.com/jenkins/





Re: CJK Unified Ideographs Extension B

2002-05-13 Thread John H. Jenkins


On Monday, May 13, 2002, at 04:21 AM, William Overington wrote:

> I have been looking at the characters in the CJK Unified Ideographs
> Extension B document.  These are the characters from U+02 through to
> U+02A6DF, which, as I understand it, are the rarer CJK characters.
>

Actually, this is not quite true.  The vast majority are rare, of course, 
and none of them are exactly *common*, but how rare they are depends on 
what you're writing.  A small number, for example, are from HK SCS and 
reflect current needs for Hong Kong, including general-purpose Cantonese 
writing.  (One is generally not supposed to write Cantonese, even if one 
speaks it, hence the lag in getting some Cantonese-specific characters 
added.)

> I wonder if any of the people who read this list who understand the
> languages involved might please like to say what any one or two of these
> characters, of their choice, mean please, just as a matter of general
> cultural interest for people who see these characters in the Unicode
> specification and, though not themselves knowledgeable of the languages,
> find the characters interesting for their artistry and history.
>
>

My personal favorite is U+233B4, which means a tree stump.  (It's formed 
by taking the "tree" radical and moving the cross-bar to the top of the 
character instead of having it in the middle.)  U+20C43 is a 
Cantonese-specific character meaning thin or flat.

Altogether, currently eighteen characters from Extension B currently have 
a kDefinition entry in Unihan.txt.

==
John H. Jenkins
[EMAIL PROTECTED]
[EMAIL PROTECTED]
http://homepage.mac.com/jenkins/





Re: CJK Unified Ideographs Extension B

2002-05-13 Thread Thomas Chan

On Mon, 13 May 2002, William Overington wrote:

> I have been looking at the characters in the CJK Unified Ideographs
> Extension B document.  These are the characters from U+02 through to
> U+02A6DF, which, as I understand it, are the rarer CJK characters.
> I wonder if any of the people who read this list who understand the
> languages involved might please like to say what any one or two of these
> characters, of their choice, mean please, just as a matter of general
> cultural interest for people who see these characters in the Unicode
> specification and, though not themselves knowledgeable of the languages,
> find the characters interesting for their artistry and history.

Culturally, the majority of them are really not that interesting.  Here's
ten random ones from Plane 2:

U+224D3 is an ancient form of U+4F5C, zuo4 'to make'.
U+22984 comes from a Vietnamese source--I don't have any info on it.
U+230C4 is xin1; meaning unknown.
U+24ECB is an erroneous form of U+765F, bie3 'shrivelled'.
U+25BF6 is ku3, the name of a kind of bamboo.
U+27028 is lie4 'movement of grass'.
U+28966 is qi2 'sharp'.
U+294DF is kan3, something having to do with a distorted or ugly head/face
  (I don't quite understand its definition.)
U+2A1FD is a variant form of U+6B4E, tan4 'to sigh'.
U+2A606 is xiu1; meaning unknown.


Thomas Chan
[EMAIL PROTECTED]







RE: Useful Resources

2002-05-13 Thread Magda Danish (Unicode)

Rajesh,

The correct links are now posted at
http://www.unicode.org/unicode/onlinedat/resources.html

Magda.

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]] 
Sent: Monday, May 13, 2002 3:10 AM
To: Markus Scherer; Magda Danish (Unicode)
Cc: unicode
Subject: Re: Useful Resources - Another round of spring cleaning

Magda Danish (Unicode) wrote:
> 
> > - Akkadian 
> 
> 
> http://www.sron.nl/~jheise/akkadian/
> 
> 
> > - Multilingual Project Gutenberg 
> > http://www.informatik.uni-hamburg.de/gutenb
> 
> 
> http://www.sharat.co.il/pg/
> 
> > - USMARC to UNIVERSAL CHARACTER SET MAPPINGS 
> > http://lcweb.loc.gov/marc/marc2ucs.html


Above link doesn't work. I am curious to visit the site.

rajesh







RE: on U+7384 (was Re: Synthetic scripts (was: Re: Private Use Agreements)

2002-05-13 Thread Richard Kunst

> On Friday, May 10, 2002, at 06:29 PM, John Cowan wrote:
>
> > What is this about Qing taboo characters?  Can someone point me to an
> > explanation (in English)?  Thanks.

One source is Charles S. Gardner _Chinese Traditional Historiography_,
(Cambridge, Mass.: Harvard University Press, 1938, 2nd printing, 1961), pp.
82-84.

I don't know how easy this book is to come by, so I put the relevant pages
on the Web here:

http://www.humancomp.org/misc/Gardner_82-85.pdf
or
http://www.humancomp.org/misc/Gardner_82-83.gif
http://www.humancomp.org/misc/Gardner_84-85.gif

Gardner (p.84, n.12) refers to other published works in Chinese, French, and
German. In case you are curious, the odd romanization Gardner uses is a sui
generis Wade-Giles, modified according to his own proposals back in the
30's.

He specifically discusses the case of xuán U+7384/U+248E5, which Thomas Chan
cited, since it is the tabooed character with by far the most far-reaching
impact, since it is a very common character, especially, as Gardner notes,
in Daoist texts. More often than substituting U+248E5, Qing texts are likely
to use yuán U+5143 instead.

An odd convention in China arose, perhaps as early as the late Han-early
Tang period (ca. 0-700 C.E.), in which as a general courtesy to the literate
public, emperors were given personal names which used relatively rare
characters, so that common characters didn't have to be tabooed by millions
of people. The problem with xuán U+7384 was that, maybe because they were
not-completely sinified Manchus, the Kangxi emperor's parents (or whoever
chose his personal name), didn't adhere to this courtesy, and to further the
inconvenience, the Qing dynasty in general and Kangxi's reign in  particular
were a period of tremendous activity in book publishing. Someone once
observed that well over half of all extant texts published in the world
before 1700(?) are in Chinese.

On Saturday, May 11, 2002 7:27 PM, John H. Jenkins wrote:

> The whole idea of "taboo" forms stems from the fact that there
> are certain
> ideographs one could not use because, typically, they're part of
> personal
> name of someone important.  So one deliberately distorts them
> when writing
> them.

"Someone important" includes your own lineal ancestors too. E.g., you would
not write, or sometimes even speak, the personal given name of your father,
or grandfather.

> Such a thing is very much time-bound.  Using a character from the
> personal
> name of the *current* emperor is a big deal, but using one from the
> personal name of an emperor five hundred years dead from an entirely
> different dynasty is no biggie.  So the Qing dictionary, the
> KangXi, would
> have some taboo forms which would later become untaboo
> (especially now, of
> course, since nobody does that kind of thing anymore).

The taboo on Confucius's given name had enough currency during the May 4th
period (1920's) and the Cultural Revolution (late 1960's), that people
relished breaking it by referring to him as Kong Qiu.

I think it is likely that even today, especially in rural areas, there are
people who honor the taboo on writing and speaking personal names
of ancestors.

Rick Kunst

_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/
The Humanities Computing Laboratory
A Nonprofit Education and Research Corporation
301 W. Main St., Suite 400-I
Durham, NC 27701 USA
Tel. (919) 667-9556, (919) 656-5915
Fax: (919) 667-9556
E-mail: [EMAIL PROTECTED]
http://www.humancomp.org
_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/






RE: To submit or not to submit

2002-05-13 Thread John Hudson

At 04:03 5/13/2002, Marco Cimarosti wrote:

>On the basis of this precedent, and on the basis of the fact that sample
>glyphs on old copies Unicode book's will have a strong influence on font
>designers for years, it may be wise in this case to leave the old
>U+0643-like character alone and add a new U+06A9-like characters.

There are only a handful of Arabic fonts that support U+06AC, and the 
developers I have heard from, directly and indirectly, are aware that the 
form shown in the book is *rare* in Jawi text. Note that, as Amir pointed 
out, the form is not incorrect, it is simply much less common than the 
other form.

I wouldn't worry about the form of the glyphs to render this character. 
This is a font issue, and at the end of the day, if a user doesn't like 
what he sees he can go in search of a different font.

I would be much more concerned about the existence of text that uses U+06AC 
for this Jawi character. The Unicode standard very clearly identifies this 
character as 'Old Malay' and, as Ken pointed out, this is the intention of 
the standard: that U+06AC be used for the Jawi ga. Do you really want to 
turn around and tell developers that no, they should now start using a 
different codepoint for this character? Fixing a couple of composite glyphs 
in a font is much easier than worrying about whether U+06AC in a text 
string might need to be cross-mapped to a new character.

>I would say that this would be a prudent choice in the case of an
>international script like Arabic: who can make an oath that no language's
>orthography ever used a letter like U+0643 with a dot above?

True enough, but that language was not 'Old Malay'. If you want to maintain 
a character for U+0643 with dot above, I would recommend adding *this* as 
the new character, and more clearly identify U+06AC as the preferred Jawi 
form. Personally, I would wait until someone provided evidence of the use 
of such a character, rather than making assumptions. Font developers 
certainly won't thank you for encoding arbitrary characters requiring the 
design of glyphs that will never be used: it is a waste of our time and of 
our clients' money.

John Hudson

Tiro Typeworks  www.tiro.com
Vancouver, BC   [EMAIL PROTECTED]

If meaning is inherently public and rule-governed, then the
fact that I can't read 'Treasure Island' without visualising
Long John Silver as a one-legged version of my grandmother
is of interest only to my psychotherapist and myself.
   Terry Eagleton





RE: Useful Resources

2002-05-13 Thread Smith,Gary

It may be helpful to know that USMARC is now known as MARC 21.

MARC 21 Specifications (including mapping tables to UCS/Unicode)
http://www.loc.gov/marc/specifications/spechome.html

Gary L. Smith
Software Architect
Database Quality & Enrichment Department
OCLC
[EMAIL PROTECTED]


-Original Message-
From: Magda Danish (Unicode) [mailto:[EMAIL PROTECTED]]
Sent: Monday, May 13, 2002 13:00 
To: [EMAIL PROTECTED]
Cc: unicode
Subject: RE: Useful Resources 


Rajesh,

The correct links are now posted at
http://www.unicode.org/unicode/onlinedat/resources.html

Magda.

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]] 
Sent: Monday, May 13, 2002 3:10 AM
To: Markus Scherer; Magda Danish (Unicode)
Cc: unicode
Subject: Re: Useful Resources - Another round of spring cleaning

Magda Danish (Unicode) wrote:
> 
> > - Akkadian 
> 
> 
> http://www.sron.nl/~jheise/akkadian/
> 
> 
> > - Multilingual Project Gutenberg 
> > http://www.informatik.uni-hamburg.de/gutenb
> 
> 
> http://www.sharat.co.il/pg/
> 
> > - USMARC to UNIVERSAL CHARACTER SET MAPPINGS 
> > http://lcweb.loc.gov/marc/marc2ucs.html


Above link doesn't work. I am curious to visit the site.

rajesh







about JIS x0213

2002-05-13 Thread Yung-Fong Tang

dear unicoder:

Could someone tell me the story about JIS x0213 and how could we encode 
JIS x0213 ? Is there a spec about how to encode JIS x0213 into SJIS, 
ISO-2022-JP or EUCJP ?

Thanks.






Normalization forms

2002-05-13 Thread Lars Marius Garshol


I have been reading the Unicode Normalization UTR and have a couple of
questions regarding it:

 - will string comparison methods based on NFC and NFD always give the
   same results? 

 - is it correct that methods based on NFKC and NFKD will give
   different results from ones based on NFC/NFD?

 - if NFC and NFD give the same results, why are both specified? Why
   would an implementation choose one over the other?

 - NFKC/NFKD seem to lose significant information; in what contexts
   are they intended to be used?

-- 
Lars Marius Garshol, Ontopian http://www.ontopia.net >
ISO SC34/WG3, OASIS GeoLang TChttp://www.garshol.priv.no >





Re: Normalization forms

2002-05-13 Thread John Cowan

Lars Marius Garshol scripsit:

>  - will string comparison methods based on NFC and NFD always give the
>same results? 

By intention, yes.

>  - is it correct that methods based on NFKC and NFKD will give
>different results from ones based on NFC/NFD?

Yes.

>  - if NFC and NFD give the same results, why are both specified? Why
>would an implementation choose one over the other?

Originally, only NFD was given, as it is sufficient.  However, text
converted from non-Unicode encodings is generally already in NFC,
so specifying NFC (which is conceptually NFD with a post-processing
pass to re-create certain precomposed characters) has certain practical
advantages.  In particular, if you are doing "early normalization",
near the point of creation, then NFC allows easy step-down to
non-Unicode encodings.

>  - NFKC/NFKD seem to lose significant information; in what contexts
>are they intended to be used?

Compatibility distinctions may or may not be important in particular
cases: often they represent distinctions that are merely historical.
One context where compatibility distinctions are typically unimportant
is in identifiers.

-- 
John Cowan <[EMAIL PROTECTED]> http://www.reutershealth.com
I amar prestar aen, han mathon ne nen,http://www.ccil.org/~cowan
han mathon ne chae, a han noston ne 'wilith.  --Galadriel, _LOTR:FOTR_




RE: Normalization forms

2002-05-13 Thread Addison Phillips [wM]

Hi Lars,

Some information below...

Addison

Addison P. Phillips
Globalization Architect
webMethods, Inc.
432 Lakeside Drive
Sunnyvale, California, USA
+1 408.962.5487 (phone)
+1 408.210.3569 (mobile)
-
Internationalization is an architecture.
It is not a feature.

> -Original Message-
> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On
> Behalf Of Lars Marius Garshol
> Sent: Monday, May 13, 2002 1:38 PM
> To: [EMAIL PROTECTED]
> Subject: Normalization forms
>
>
>
> I have been reading the Unicode Normalization UTR and have a couple of
> questions regarding it:
>
>  - will string comparison methods based on NFC and NFD always give the
>same results?

The same results compared to what? If you mean:

if {C}=={c} then {D}=={d}, then the answer is yes.

If you mean:

if {C} == {c} then {C} == {d}, then the answer is no. The forms are not
commutative.

>
>  - is it correct that methods based on NFKC and NFKD will give
>different results from ones based on NFC/NFD?

Yes. Emphatically. For example:

U+FF21 is U+FF21 in form C and does not equal U+0041.

but:

U+FF21 in Form KC becomes U+0041...

>
>  - if NFC and NFD give the same results, why are both specified? Why
>would an implementation choose one over the other?

Again the question is what you mean by "results". The composed form is
actually different than the decomposed one. It is generally more compatible
with what naive rendering software expects. The decomposed form, by
comparison, makes certain kinds of processing more efficient (for example,
certain kinds of collation processing).

>
>  - NFKC/NFKD seem to lose significant information; in what contexts
>are they intended to be used?

They have a number of useful contexts. Namespaces are one. Generally
speaking, the vast majority of characters unified by the compatibility forms
are rendering differences (such as half-width forms, super/sub scripts, and
the like) which make trouble in restricted namespaces (such as programming
identifiers, domain names, and the like). In addition, it is often possible
to introspect more meaning from data input fields by applying K forms.

For example, in some of the webMethods tools GUIs, strings that do not parse
successfully as numbers on the first pass are normalized Form KC (except for
super/subscripts) in order to improve parsing success.

>
> --
> Lars Marius Garshol, Ontopian http://www.ontopia.net >
> ISO SC34/WG3, OASIS GeoLang TChttp://www.garshol.priv.no >
>
>
>