Ah, I see why you didn't find the Alphabetic property. It was added in Unicode 3.1.0 (March 2001), precisely to capture characters that are not L yet are still alphabetic. If you look at the derivation in C:\DATA\UCD\3.1.0-Update\DerivedCoreProperties-3.1.0.txt, you will see:
# Derived Property: Alphabetic # Generated from: Lu+Ll+Lt+Lm+Lo+Nl + Other_Alphabetic So Alphabetic includes all L's, but also other characters. And, as I said, it alone is not sufficient for word breaks. > Is the ommision of 30FC from the Alphabetic category of PropList.txt an > error? This is not an oversight. As I said, many characters are not Alphabetic and are still part of words. Examples include that character and many others. As a simple case, "can't" is a word in English, although the apostrophe is not alphabetic. There are many, many examples using combining marks, such as a virama (halant) in Hindi, which is not Alphabetic: http://oss.software.ibm.com/cgi-bin/icu/ub/utf-8/?ch=094D So if you want reasonable word-breaks, you need to use more than the L category, you need to look at > http://www.unicode.org/reports/tr14/ > http://www.unicode.org/reports/tr29/ Mark __________________________________ http://www.macchiato.com ► “Eppur si muove” ◄ ----- Original Message ----- From: "Mount, Rob (Robert F)" <[EMAIL PROTECTED]> To: "Mark Davis" <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]> Sent: Thursday, June 05, 2003 11:57 Subject: RE: Classification of U+30FC KATAKANA-HIRAGANA PROLONGED SOUND MARK > > Thanks to all who responded. The insight you provided is invaluable. And I > > appreciate your patience with a UNICODE beginner. > > Mark's reference to UCD.html, and by inference to DerivedCoreProperties.txt, > seems difinitive. However, these are part of the 4.0 spec. The suspect > implementation of isalpha is based, according to the vendor, on 3.0.1. > > The vendor relys, instead, on > http://www.unicode.org/Public/3.0-Update1/PropList-3.0.1.txt > which classifies 30FC as Diacritic, Extender, Bidi: Left-to-Right, and > Identifier Part, but not > as Alphabetic. Is this an error in the specification? I could find no > reference to the Alphabetic > property in the 3.0.1 documentation except in, and with reference to, > PropList-3.0.1.txt. > However, it would seem, from the 4.0 documentation, that all characters > having a General > Category beginning with "L" should be considered as letters, and hence, > implicitly, as Alphabetic. > > Is this, indeed, the intent of the General Category classifications > beginning with "L"? > > Is the ommision of 30FC from the Alphabetic category of PropList.txt an > error? > > Rob > > -----Original Message----- > From: Mark Davis [mailto:[EMAIL PROTECTED] > Sent: Thursday, June 05, 2003 9:28 AM > To: Mount, Rob (Robert F); [EMAIL PROTECTED] > Subject: Re: Classification of U+30FC KATAKANA-HIRAGANA PROLONGED SOUND > MARK > > > The UCD has a property explicitly called "Alphabetic" in the UCD. So > that should be used when determining whether a character is, well, > alphabetic. See http://www.unicode.org/Public/UNIDATA/UCD.html > > However, in the past many people have misused functions like isAlpha() > for doing more complicated processing like determining text boundaries > (line and word breaks, for example). The function isAlpha() does not > discriminate finely enough to be very accurate for processing like > that. For more information, see > http://www.unicode.org/reports/tr14/ > http://www.unicode.org/reports/tr29/ > > Also see the proposed update to Unicode Regular Expressions, for > discussion of the use of Unicode properties in connection with alpha, > punct, etc. (in the context of regular expressions, at least). > http://www.unicode.org/reports/tr18/tr18-7.html#Compatibility_Properties > > Mark > __________________________________ > http://www.macchiato.com > ? "Eppur si muove" ? > > ----- Original Message ----- > From: "Mount, Rob (Robert F)" <[EMAIL PROTECTED]> > To: <[EMAIL PROTECTED]> > Sent: Wednesday, June 04, 2003 16:11 > Subject: Classification of U+30FC KATAKANA-HIRAGANA PROLONGED SOUND > MARK > > > > All, > > I am investigating differing behavior in various environments of the > > wide-character version of the C function isAlpha with respect to > > character U+30FC KATAKANA-HIRAGANA PROLONGED SOUND MARK. Some > > implementations indicate that it is alphabetic, some don't. I > > suspect that other characters might be subject to the same > confusion. > > > > The UNICODE documents seem abiguous on this point: the General > > Catetory is "Lm" which, although informative instead of normative, > > would seem to imply that it is alphabetic; likewise > > DerivedCoreProperties-4.0.0.txt indicates that it is alphabetic; but > > PropList-4.0.0.txt contains two records - one indicating that it is > > a diacritic, one that indicates it is an extender. > > > > On to my questions: > > > > Q1: Can a character be both alphabetic and diacritic? > > > > Q2: Is there a difinitive answer as to whether this is an alphabetic > > character? > > > > Thanks in advance for answers to these questions and/or any > > additional isight you can provide. > > > > Regards, > > Rob Mount > > > > > > > > > > > > > > > > > > >