I don't think the "fold to base" is as useful as some other information. For
those characters with a canonical decomposition, the decomposition carries more
more information, since you can combine it with a "remove combining marks"
folding to get the folding to base.

For my part, what would be more interesting would be a "full" decomposition of
the characters that don't have a canonical decomposition, e.g.

LATIN CAPITAL LETTER O WITH STROKE => O + /

BTW, I had posted some commentary on TR30, which I will repeat here.

... I found these files almost
impossible to assess in code point form, so I ran them through a quick ICU
transform to add comments with the real characters and names. I also NFC'd the
forms, just for consistency. These files generated from Asmus's are in
http://www.macchiato.com/utc/tr30/.

I had suggest posting them in this form for public review of the TR, since
others will have the same difficulty in assessing the quality of the data.

Here are some quick comments.

http://www.macchiato.com/utc/tr30/HiraganaFolding-new.txt

Adding digraph expansions seems quite odd.

http://www.macchiato.com/utc/tr30/KatakanaFolding-new.txt

When in NFC, whole batches of these mappings are NOPs. Don't know why they are
there; they are also not consistent in the use of composed vs. decomposed forms.

This file combines half-width katakana folding. I think it is much more useful
if that is separated out. Someone can apply a sequence of two transforms if they
want both.

http://www.macchiato.com/utc/tr30/SuperscriptFolding-new.txt

This feels like a real potpourri of stuff. Why superscripts and not subscripts?
Why annotation characters? Why modifier letters -- those are not really
superscripts. Waw?

http://www.macchiato.com/utc/tr30/WidthFolding-new.txt

This file would be MUCH more useful if in two separate files.

Full-width to half-width
Half-width to full-width

Again, remove the NFC mappings.

27E6; 301A # â â ã MATHEMATICAL LEFT WHITE SQUARE BRACKET â LEFT WHITE SQUARE
BRACKET

These don't appear to be a width issue.

Note that I have not checked these new data tables for completeness; these were
just some quick observations.


Mark
__________________________________
http://www.macchiato.com
â ààààààààààààààààààààà â

----- Original Message ----- 
From: <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Tue, 2004 May 25 14:57
Subject: Re: New Public Review Issue posted


> Rick McGowan scripsit:
> > The Unicode Technical Committee has posted a new issue for public
> > review and comment. Details are on the following web page:
> >
> > http://www.unicode.org/review/
>
> I have prepared a draft DiacriticFolding.txt file for this issue; it is
> temporarily available at http://www.ccil.org/~cowan/DiacriticFolding.txt .
> This was prepared by looking for lines in UnicodeData that matched
> the regex '(GREEK|LATIN|CYRILLIC|HEBREW).*WITH'.  (I added Hebrew to the
> set of scripts specified by the current draft of #30.)
>
> Characters with decompositions were mapped into the base character of the
> decomposition; characters without decompositions were mapped by name.
> The file http://www.ccil.org/~cowan/DiacriticFoldingExceptions.txt contains
> a list of 32 characters matching the pattern which did not seem to me
> to be suitable for diacritic folding.
>
> I have posted a short version of this note to the Unicode comment form.
>
> Comments?
>
> -- 
> A rabbi whose congregation doesn't want         John Cowan
> to drive him out of town isn't a rabbi,         http://www.ccil.org/~cowan
> and a rabbi who lets them do it                 [EMAIL PROTECTED]
> isn't a man.    --Jewish saying                 http://www.reutershealth.com
>
>


Reply via email to