Re: Normalization forms

2002-05-13 Thread John Cowan

Lars Marius Garshol scripsit:

  - will string comparison methods based on NFC and NFD always give the
same results? 

By intention, yes.

  - is it correct that methods based on NFKC and NFKD will give
different results from ones based on NFC/NFD?

Yes.

  - if NFC and NFD give the same results, why are both specified? Why
would an implementation choose one over the other?

Originally, only NFD was given, as it is sufficient.  However, text
converted from non-Unicode encodings is generally already in NFC,
so specifying NFC (which is conceptually NFD with a post-processing
pass to re-create certain precomposed characters) has certain practical
advantages.  In particular, if you are doing early normalization,
near the point of creation, then NFC allows easy step-down to
non-Unicode encodings.

  - NFKC/NFKD seem to lose significant information; in what contexts
are they intended to be used?

Compatibility distinctions may or may not be important in particular
cases: often they represent distinctions that are merely historical.
One context where compatibility distinctions are typically unimportant
is in identifiers.

-- 
John Cowan [EMAIL PROTECTED] http://www.reutershealth.com
I amar prestar aen, han mathon ne nen,http://www.ccil.org/~cowan
han mathon ne chae, a han noston ne 'wilith.  --Galadriel, _LOTR:FOTR_




RE: Normalization forms

2002-05-13 Thread Addison Phillips [wM]

Hi Lars,

Some information below...

Addison

Addison P. Phillips
Globalization Architect
webMethods, Inc.
432 Lakeside Drive
Sunnyvale, California, USA
+1 408.962.5487 (phone)
+1 408.210.3569 (mobile)
-
Internationalization is an architecture.
It is not a feature.

 -Original Message-
 From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On
 Behalf Of Lars Marius Garshol
 Sent: Monday, May 13, 2002 1:38 PM
 To: [EMAIL PROTECTED]
 Subject: Normalization forms



 I have been reading the Unicode Normalization UTR and have a couple of
 questions regarding it:

  - will string comparison methods based on NFC and NFD always give the
same results?

The same results compared to what? If you mean:

if {C}=={c} then {D}=={d}, then the answer is yes.

If you mean:

if {C} == {c} then {C} == {d}, then the answer is no. The forms are not
commutative.


  - is it correct that methods based on NFKC and NFKD will give
different results from ones based on NFC/NFD?

Yes. Emphatically. For example:

U+FF21 is U+FF21 in form C and does not equal U+0041.

but:

U+FF21 in Form KC becomes U+0041...


  - if NFC and NFD give the same results, why are both specified? Why
would an implementation choose one over the other?

Again the question is what you mean by results. The composed form is
actually different than the decomposed one. It is generally more compatible
with what naive rendering software expects. The decomposed form, by
comparison, makes certain kinds of processing more efficient (for example,
certain kinds of collation processing).


  - NFKC/NFKD seem to lose significant information; in what contexts
are they intended to be used?

They have a number of useful contexts. Namespaces are one. Generally
speaking, the vast majority of characters unified by the compatibility forms
are rendering differences (such as half-width forms, super/sub scripts, and
the like) which make trouble in restricted namespaces (such as programming
identifiers, domain names, and the like). In addition, it is often possible
to introspect more meaning from data input fields by applying K forms.

For example, in some of the webMethods tools GUIs, strings that do not parse
successfully as numbers on the first pass are normalized Form KC (except for
super/subscripts) in order to improve parsing success.


 --
 Lars Marius Garshol, Ontopian URL: http://www.ontopia.net 
 ISO SC34/WG3, OASIS GeoLang TCURL: http://www.garshol.priv.no