[Apologies if this issue has already been resolved. I searched the Unicode.org site for discussions but I only found document dating from 2003 which touches on the issue: andrewcw...@alumni.princeton.edu RE: Unicode 4.0.1 Beta Review 1. kRSUnicode Field ( http://www.unicode.org/L2/L2003/03311-errata4.txt)]
A CJK Han character is conventionally viewed as consisting of a radical plus a residual part or "phonetic". (For a character which is a radical the residual part is nothing. The term "phonetic", indicating that the residual part of the character points the pronunciation of the character, properly only applies to 90-95% of characters, but it applies in the examples below. ) The two parts of a character each consist of a specific arrange of strokes, and together account for all the strokes in the character. In particular, the number of strokes in the radical portion plus the number of strokes in the residual portion equals the total number of strokes in the character. The stroke count of a radical combined with a residual part is not always the same as the stroke count of the radical appearing on its own, but may be slightly or significantly less due to a minor or major abbreviation. (A radical may have several forms which are used in different positions of the whole character, say left or right side vs. top or bottom. These variants may have the same or different stroke counts.) Because of abbreviated variants the total stroke count for a character cannot be always be gotten by adding the stroke count of the radical in its standalone form to the stoke count of the residual portion. However, it can always be gotten by subtracting the stroke count of residual portion from the total stroke count of the character. The Unihan database provides the exact data needed to make this calculation: kTotalStrokes: stroke count for full character kRSUnicode: radical number and residual stroke count (in format <rad_num>['].<res_strokes>, where optional ' (apostrophe) in the latter indicates a widely used abbreviation for the radical with a significantly different appearance and a significantly (-3 or more) lower stroke. (But not all such forms are so marked - examples are forms with radical numbers 140, 162,163,170. It may be that the marker is limited to abbreviations uses in Simplified as opposed to Traditional Chinese characters.) The formula is simply: radStrokes(K) = kTotalStrokes(K) - kRSUnicode(K).resStrokes This formula generally gives correct results, but not always. In fact, according to reasonably accurate heuristic test I ran it produces incorrect (or at least "suspicious") results in 2236 of the total 74911, or 3%, of characters in the database that have both kTotalStrokes and kRSUnicode data. Moreover the rate is significantly higher for the characters in the BMP than in the SIP - in fact it is really negligible in the latter. Most importantly it is 8.2% in the block containing all the most widely used characters, the base CJK Unified Ideographs block. The numbers for all the blocks are as follows: RANGE TOTAL* SUSPICIOUS PCT BMP BASE 20941 1727 8.2 CMP 302 29 9.6 CMPS 4 0 0.0 EXTA 6582 469 7.1 SIP EXTB 42711 6 0.0 EXTC 4149 5 0.1 EXTD 222 0 0.0 TOTAL 74911 2236 3.0 *with both kTotalStrokes and kRSUnicode Some of the suspicious cases are actually valid, but I believe that vast majority are truly incorrect, and that the rate of incorrect radical stroke counts implied by kTotalStrokes and kRSUnicode is at least 6-7% for the base CJK Unified Ideographs block. Here are a couple examples where the stroke counts are fairly small and the radicals and the residual parts ("phonetics") widely occurring. The first illustrates the situation where the radical stroke count implied by kTotalStrokes and kRSUnicode is greater than the correct value, and the second that where the implied radical stroke count is less than the correct value. (The second situation is much more common than the first, accounting for at least 80% of the "suspicious" items. Example 1: character U+4E9B 'a few' kTotalStrokes = 8 kSRUnicode = 7.5 radical number = 7 'two' residual strokes = 5 implied radical stroke count = 3 (8 - 5) correct radical stroke count = 2 diff = 1 (implied count one too high) The residual portion of the character occurs as an independent character U+6B64 'this, these'. Its kTotalStrokes is 6 and its kRSUnicode = 77.2. The radical #77 'stop' has 2 strokes in its standalone form, so the residual stroke count of 2 is consistent with a total count of 6. In the main character U+4E9B, therefore, the residual part has effectively lost a stroke in composition, being reduced from 6 to 5. (This actually seems to be the norm with this phonetic. Other examples are U+4F4C, U+5470, U+5472, U+59D5, U+67F4, U+75B5, U+7689, U+7725 and I'm sure more.) Example 2: is character U+5040 'distinguished person; English person' kTotalStrokes = 10 kSRUnicode = 9.9 radical number = 9 'person' residual strokes = 9 implied radical stroke count = 1 (10 - 9) correct radical stroke count = 2 diff = -1 (implied count one too low) Again the residual portion occurs as an independent character U+82F1 'distinguished; English'. Its kTotalStrokes is 8 and its kRSUnicode is 140.5. Radical #140 'grass' has 6 strokes in its standalone form but as the radical component of a larger character is always abbreviated to a form with 3 strokes. That is the case here. Thus residual count of 5 in the kRSUnicode of U+82F1 is consistent with the kTotalStrokes of 8 for the character. This count of 8 agrees with the residual count for the full character U+5040 implied by its 10 kTotalStrokes, but is one less than the 9 residual strokes specified in the kRSUnicode. In both examples the discrepancy between kTotalStrokes and KRSUnicode arise out of different residual stroke counts and have nothing to do with the radical, be it its identity, the variant used, or the stroke count. While there are some exceptions, this is clearly the normal situation. It also makes sense. Most disagreements on stroke counts have to do with the residual as opposed to the radical portion of the characters. (Question of radical counts usually involve cases where the radical has more than one form in a given context, for example rad #140 'grass', which has 6 strokes in its full form but variants in the top position context with 3 and 4 strokes. Less commonly, they involve cases where the radical is fused with the residual portion or even lost altogether as part of a historical simplification.) As mentioned above, discrepancies of the type illustrated by the second example (implied radical stokes higher than correct) are much more common than discrepancies of the type illustrated by the first example (implied radical stokes less than correct). To the extent the discrepancies involve the residual stroke counts and have nothing to do with the radical, the situation can be reframed in terms of residual stroke counts as: Dominant pattern: the residual stroke count specified in kRSUnicode is greater than that implied by kTotalStrokes (5 vs. 6 strokes in Ex. 2) Minor pattern: the residual stroke count specified in kRSUnicode is less than that implied by kTotalStrokes (9 vs. 8 strokes in Ex. 1) The results of the heuristic test indicate that the great majority of cases of both patterns involve differences in residual stroke counts of one or occasionally two strokes. I believe this is in line with the variations in stroke counting that are observed in actual practice (dictionaries etc.). Still, the question needs to be asked, do the discrepancies (which occur in 5% of all characters in the base Unicode character set) simply represent different, but more or less equally valid, ways of counting strokes, or are they errors that need to be corrected or at least addressed in some way? In my view the answer depends on a more specific question: are kTotalStrokes and KRSUnicode intended to be consistent? That is, regardless of what exact count is chosen for a given character, should both terms reflect the same count? Here is how the two fields are described in the document Proposed Update to Unicode Standard Annex #38 Unicode 6.0.0 draft 1 ( http://www.unicode.org/reports/tr38/tr38-8.html): kTotalStrokes: "The total number of strokes in the character (including the radical). _This value is for the character as drawn in the Unicode charts_." kRSUnicode: "A standard radical/stroke count for this character in the form "radical.additional strokes". The radical is indicated by a number in the range (1..214) inclusive. An apostrophe (') after the radical indicates a simplified version of the given radical. The "additional strokes" value is the residual stroke-count, the count of all strokes remaining after eliminating all strokes associated with the radical. This field is also used for additional radical-stroke indices where either a character may be reasonably classified under more than one radical, or alternate stroke count algorithms may provide different stroke counts. _The first value is intended to reflect the same radical as the kRSKangXi field and the stroke count of the glyph used to print the character within the Unicode Standard_. When I talk about kRSUnicode I always mean the first value in the list. Similarly my heuristic test always uses the first value. I mention this because of the way the last paragraph of the description refers specifically to this value. Both descriptions tie the specific values of the two fields to the specific glyphs used to draw/print the character in the Unicode charts (kTotalStrokes "character as drawn in the Unicode charts", kRSUnicode "the glyph used to print the character within the Unicode Standard"). Given this, the answer to the question of whether the two fields should be consistent certainly seems to be yes. And this means that the cases where they are not, i.e. where there are discrepancies, are errors. If it's conceded that the discrepancies do reflect errors, then I think it also needs to be conceded that they need to be addressed in some way. The most straightforward thing would be to go through all the cases and change either kTotalStrokes or kRSUnicode to (a) be consistent and (b) offer values appropriate to the specific glyph used in the standard. Given that kRSUnicode is used for ordering characters in the block (the radical number being used to determine what radical it is listed after and the residual count being used to determine where after the radical it appears - except for ties, which are ordered arbitrarily), while to the best of my knowledge kTotalStrokes is not used for anything within the standard, the most practical thing would be to keep the existing kRSUnicode value wherever it is not obviously incorrect and adjust the kTotalStrokes to be consistent with it. But this involves changing a lot of data - including data for the most widely used characters, those in the base CJK Unified Ideograph block -, and may break systems that use the existing values. An alternative which I would suggest is to create a new field which could be called kRSUnicode2 or something similar and would have not two but three subfields (not counting apostrophe) <rad_num>['].<rad_strokes>.<res_strokes> where the first and third subfields are the same (same meaning, same values, barring clear errors) as in kRSUnicode and the added second subfield is the number of strokes in the radical as it appears in the character. This new field would contain all the stroke count information that's needed for a character, including not only the residual strokes but also the radical strokes and, via calculation (adding the two values), the total strokes. The last can be compared with kTotalStrokes, but does not depend on it, and may be different. (Note that the presence of apostrophe would become largely predictable from a comparison of the radical stroke count in the first subfield with the count for the radical as a standalone character. In fact it would only be necessary to retain it if its purpose was not simply to indicate significantly abbreviated radicals in general but specifically to indicate forms that are used in Chinese Simplified but not the corresponding Traditional ones.) I see the following advantages to this approach: (1) No constraints are placed on existing kTotalStrokes or kRSUnicode values - they can be left as is or changed at any point without implications for the new kSRUnicode2 values (2) No systems that use the existing kTotalStrokes or kRSUnicode fields will break or be affected in any way (though they could be changed to use the self-standing kRSUnicode2 field with possibly more satisfactory results) (3) All stroke information for a character is contained in a single field, kRSUnicode2, and can't be inconsistent (though it can be wrong) (4) Stroke counting differences between fields can be directly found and quantified (particularly, by comparing the partial stroke information in kTotalStrokes and/or kRSUnicode to the full information in kRSUNicode2 (5) An initial version of the full set of the new kRSUnicode2 field values could be generated algorithmically from kTotalStrokes and kRSUnicode and then revised by human inspection focusing on the proportionally small amount (8% in the base block, 3% overall) of "suspicious" cases detected by a heuristic procedure (which I'm sure could be made more accurate than the one I used, for example by bringing in more existing information sources) The main disadvantages I see are: (1) Confusion arising from the overlap between the old and new fields (2) The work involved (though anything other than dismissing or postponing the issue is going to involve work) If there is interest I will be glad to share the results of my heuristic test and the program (python) I used to produce them. John Armstrong Cambridge MA
_______________________________________________ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode