I don't think the "fold to base" is as useful as some other information. For those characters with a canonical decomposition, the decomposition carries more more information, since you can combine it with a "remove combining marks" folding to get the folding to base.
For my part, what would be more interesting would be a "full" decomposition of the characters that don't have a canonical decomposition, e.g. LATIN CAPITAL LETTER O WITH STROKE => O + / BTW, I had posted some commentary on TR30, which I will repeat here. ... I found these files almost impossible to assess in code point form, so I ran them through a quick ICU transform to add comments with the real characters and names. I also NFC'd the forms, just for consistency. These files generated from Asmus's are in http://www.macchiato.com/utc/tr30/. I had suggest posting them in this form for public review of the TR, since others will have the same difficulty in assessing the quality of the data. Here are some quick comments. http://www.macchiato.com/utc/tr30/HiraganaFolding-new.txt Adding digraph expansions seems quite odd. http://www.macchiato.com/utc/tr30/KatakanaFolding-new.txt When in NFC, whole batches of these mappings are NOPs. Don't know why they are there; they are also not consistent in the use of composed vs. decomposed forms. This file combines half-width katakana folding. I think it is much more useful if that is separated out. Someone can apply a sequence of two transforms if they want both. http://www.macchiato.com/utc/tr30/SuperscriptFolding-new.txt This feels like a real potpourri of stuff. Why superscripts and not subscripts? Why annotation characters? Why modifier letters -- those are not really superscripts. Waw? http://www.macchiato.com/utc/tr30/WidthFolding-new.txt This file would be MUCH more useful if in two separate files. Full-width to half-width Half-width to full-width Again, remove the NFC mappings. 27E6; 301A # â â ã MATHEMATICAL LEFT WHITE SQUARE BRACKET â LEFT WHITE SQUARE BRACKET These don't appear to be a width issue. Note that I have not checked these new data tables for completeness; these were just some quick observations. Mark __________________________________ http://www.macchiato.com â ààààààààààààààààààààà â ----- Original Message ----- From: <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Sent: Tue, 2004 May 25 14:57 Subject: Re: New Public Review Issue posted > Rick McGowan scripsit: > > The Unicode Technical Committee has posted a new issue for public > > review and comment. Details are on the following web page: > > > > http://www.unicode.org/review/ > > I have prepared a draft DiacriticFolding.txt file for this issue; it is > temporarily available at http://www.ccil.org/~cowan/DiacriticFolding.txt . > This was prepared by looking for lines in UnicodeData that matched > the regex '(GREEK|LATIN|CYRILLIC|HEBREW).*WITH'. (I added Hebrew to the > set of scripts specified by the current draft of #30.) > > Characters with decompositions were mapped into the base character of the > decomposition; characters without decompositions were mapped by name. > The file http://www.ccil.org/~cowan/DiacriticFoldingExceptions.txt contains > a list of 32 characters matching the pattern which did not seem to me > to be suitable for diacritic folding. > > I have posted a short version of this note to the Unicode comment form. > > Comments? > > -- > A rabbi whose congregation doesn't want John Cowan > to drive him out of town isn't a rabbi, http://www.ccil.org/~cowan > and a rabbi who lets them do it [EMAIL PROTECTED] > isn't a man. --Jewish saying http://www.reutershealth.com > >