Actually, there are a number of loose ends still, as it appears that some of Rob Mount's questions were not actually answered.
> I understand what you say about word formation, and > combining marks, and that the Alphabetic > classification should not be limited to "L"s. But > 30FC is of General Category "Lm" (which should be > included) and, since version 3.1, is classified explicitly > as Alphabetic in DerivedCoreProperties.txt. > (It appears that formal expression of the Alphabetic > property was moved from PropList.txt > to DerivedCoreProperties.txt in 3.1.) I don't understand > why its exclusion from the Alphabetic > category in 3.0.1 was not an oversight. But if not, > then either the consortium consensus on > the classification of this character has changed, or > the current classification is in error. Here's some more background for people. I realize that all the version information is getting bewilderingly complex, so not everyone is going to want to research back through all the versions, particularly when that would mean also trying to dig back through the UTC decision trail. >From Unicode Version 2.0 to Unicode Version 3.0.1 I maintained the PropList.txt file. During that time, it was explicitly an *informative* file only, and was included in the UNIDATA directory on that basis, as potentially helpful information, only. The change to Unicode Version 3.1.0 was a major watershed. Mark Davis started maintaining the PropList.txt file (and a number of other files) with a different set of tools that specified a large number of properties as derived, via rule, from other properties -- hence the introduction of the DerivedXXX files. At this point, the UTC reexamined all of the character properties and changed the status of some of them. Some of the former properties from PropList.txt were made normative (and their content adjusted slightly), some were left informative, some were equated to derived properties (hence moved to other files), and some were determined to be uninteresting, and thus were dropped altogether. The format of PropList.txt also changed completely at this point. Now as regards the particular handling of U+30FC, the treatment in PropList.txt from Unicode 2.0 to Unicode 3.0.1 was consistent: General Category = Lm PropList specification: [-Alphabetic] [+Diacritic] [+Extender] [+Identifier_Part] The theory behind that was that while U+30FC was Lm, like many other diacritic letter modifiers it wasn't formally part of an alphabetic or syllabic set of symbols per se, so wouldn't be given the Alphabetic property. However, other implicit derivations for word boundaries or identifier boundaries should include the [+Extender] characters to get the expected results. Hence the determination, for example, that U+30FC was [+Identifier_Part]. Starting with Unicode 3.1.0 and continuing through to Unicode 4.0.0, the treatment is still consistent, although slightly different: General Category = Lm PropList specification: [-Other_Alphabetic] [+Diacritic] [+Extender] DerivedCoreProperties: [+Alphabetic] [+ID_Continue] The General Category, the status as diacritic and extender, and the derived status as part of identifiers are unchanged. What has changed, however, is the interpretation of what "Alphabetic", as a derived property now, means. As Mark pointed out, it is now derived as: # Derived Property: Alphabetic # Generated from: Lu+Ll+Lt+Lm+Lo+Nl + Other_Alphabetic By *this* definition, all the Lm characters from Unicode 2.0 on would *also* have been "Alphabetic". And Other_Alphabetic was consistently developed by subtracting out all the Lu, Ll, Lt, Lm, Lo, and Nl characters from the preexisting Alphabetic definition from PropList.txt. So the correct answer is not that the consensus about the behavior and properties of U+30FC has changed, but rather that the inclusiveness of the "Alphabetic" property changed a little when it was redefined to be a derived property. Note that for the property more relevant to determination of things like identifiers (now known as ID_Continue), there has been *no* change to the behavior of U+30FC since Unicode 2.0. > Here's a little more background regarding my motivation. > The problem occurs in a procedure > that evaluates whether a user-supplied name can be used > as an identifier - for which identification > of alphabetic characters is important. Actually, as you can see from the above discussion, and from the discussion of identifiers you mentioned in the standard, it is ID_Start and ID_Continue which are more relevant than "Alphabetic" per se. > One implementation of isalpha(), purportedly based on > Unicode 2.1, indicates that 30FC is an alpha character. > The current implementation from the > same vendor, based on 3.0.1, classifies it as non-alpha. > Presumably the next one will be based > on 3.1 or later and will reclassify it, again, as alpha. The vendor has done something based on its own interpretation of the informative data files, then. The status of U+30FC did not change between Unicode 2.1 and Unicode 3.0.1 in the informative PropList.txt data file -- so whatever they did was on their own hook. > If we can't depend on uniform behavior > of isalpha() we will have to eliminate its use from our > validation function. I'd advise you to check the reference that Mark supplied regarding the use of POSIX functions in the context of Unicode character properties in the Proposed Update to UTS #18: http://www.unicode.org/reports/tr18/tr18-7.html See, particularly, Annex C: Compatibility Properties. There has been a lot of confusion about what isalpha() could mean in the context of a Universal Character Set, and POSIX provided little guidance for how to make the extensions. Note that Java and Perl are handling this differently than people who follow the recommendations of ISO TR 10176 (which excludes combining marks based on its own theory of what should be included in identifiers). > > So I am trying to discover why the behavior of isalpha() > has changed. Here are the possibilities: > 1) the previous implementation was incorrect and the current > one is fixed; > 2) the current implementation is flawed because it does not > conform to the documented standard; > 3) the current implementation is flawed because it's based on > incorrect documentation of the standard; > 4) both implementations are correct but are > based on different, incompatible standards; > 5) something else I don't yet understand. 5. None of the above. 5a. The property was informative in the first place, so a claim of conformance prior to the mechanisms put in place in Unicode 3.1.0 was a little out-of-place, anyway. 5b. The Alphabetic property for U+30FC did not change between Unicode 2.1 and Unicode 3.0.1, so why your vendor changed it is based on some extraneous factor, and not based on some change in PropList.txt or a change in its documentation. 5c. What changed beginning with Unicode 3.1.0 was the scope of the Alphabetic property itself (based on its switch to being a derived property), rather than any implication for how the particular character U+30FC should behave in implementations. > > The overriding assumption for this entire discussion is that > the behavior of isalpha() should > be governed by the Unicode Alphabetic property. That seems > reasonable to me and is, in fact, the vendor's claim. This is, in fact, what the UTC is now formally recommending, in the Proposed Update for UTS #18. It is not, however, what every vendor does for an isalpha() implementation in detail. > If not, (or even if so) perhaps someone can recommend a better > (or more stable) API for discovery of Unicode character metrics > upon which we might base > our identifier validation and other character processing logic. Unless you are specifically depending on Windows platform API's to make such determination, I would suggest the ICU implementation of character properties as likely to be the most accurate and up-to-date in a generally available cross-platform library. --Ken