Thanks for your long and detailed reply Richard. (The full version came to me directly so I could see it.) It will take me some time to digest it, but I since you suggest I submit something to UTC I want to clarify the extent of my knowledge of and ongoing involvement with Han characters.
I started out life as a linguist but have worked in software for the past 20 years. My main work now involves web crawling and page and entity identification focusing strongly on English language sources. I ran into the issue I've described in this mailing list will doing a personal project involving correlating Sino-Japanese and Sino-Korean vocabulary. I actually am more interested in the readings of the characters (Japanese On and Korean Eum) than the characters themselves, but I am trying to leverage the fact the two languages normally write cognate Sino-X words with the same characters (plus or minus variations in form). I wanted to have stroke counts of the form Total-Rad-Residual such as I am used to from Jack Halpern's Japanese character dictionaries and was fully confident that I could make them by simply combining the kTotalStrokes and kRSUnicode fields in the manner I indicate in my message. But I started noticing occasional examples where the implied radical stroke counts seemed to large or too small, and I modified a tool I already had to try to detect cases of this algorithmically. I don't know Chinese and I have a lot of trouble making out Chinese characters when they are printed in normal size due lack of familiarity made worse by poor eyesight and a touch of dyslexia. I am certainly willing to give UTC a complete list of characters (through Extension D) and their status as suspicious or not along with some stats that the tool uses to make its decisions. In fact that I already have. Beyond that I might be able to commit to a submitting list of kTotalStrokes that should be corrected to match the lRSUncodes. I definitely to not have either the time or the knowledge to determine the correctness ate kRSUnicode values or do anything with variants. But I'm not sure I am the best person to do this. Based on the information about CDL I see on the Wenlin Institute website I sense you already have a full compositional model and could use it to produce a list of corrections that would be far more accurate than anything I could do. In terms of changing or adding fields, while I think the original separation of kTotalStrokes and kRSUnicode was a poor design choice (though maybe unavoidable for historical reasons), I'm thinking more and more that it's not worth making a change just to fix the issue I'm raising, and a better next step would be to represent characters as specific formally recognized radical variants (with fixed stroke counts) + residual stroke counts. This would be a first step towards a compositional model but could be done without getting into all the complexity and difficulty of a full recursive model. What do you think? Feel free to respond off-list if you prefer. -- John
_______________________________________________ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode