1. Here is my take, if you are trying to slim down: > canonical combining class
This is really useful for matching. For example, if my source text is NFD and I want to recognize whatever is canonically equivalent to a-ring (with perhaps other accents), then I have to use something like the following (syntax may vary, and I throw in some variables for clarity): $nonAboveAccent = [\p{ccc!=0}\p{ccc!=above}] ; $ringAbove = \u030A ; $pattern = [aA] $nonAboveAccent* $ringAbove ; Note that it is more complicated to do the same thing in NFC; it looks something like: $aWithSomeRing = [aA\u01FA\u01FB] // $aWithWithNonAboves = [ÀÁÂÃÄàáâãäĀā....] ; // many more $pattern = ($aWithSomeRing | $aWithWithNonAboves $nonAboveAccent* $ringAbove ) ; > bidi class ? > east asian width > arabic joining type ? > arabic joining group ? > line break These are all similar, and have to do with eventual appearance on the screen. If you are trying to match expressions based on one of these eventual display features, then they are useful; otherwise, they aren't particularly. > all binary properties See http://www.unicode.org/Public/UNIDATA/PropertyAliases.txt Of these, the Other_X are only contributory, and can be omitted. The Expands_On_X are not, in my opinion, particularly useful, and mainly included for historical reasons. You may want to look at the ICU property support for 2.2 (in beta) just for comparison. See http://oss.software.ibm.com/cvs/icu/~checkout~/icuhtml/design/ucd_icu. html. The items marked "U" are included in UnicodeSet which corresponds to [...] in regular-expression engines. 2. I'll give a more concrete example: [\p{script=latin}\g{aa} & \p{lowercase}] The way we interpret this is that [\p{lowercase}] is a set of code points, as you do. I wouldn't try anything fancier. When you AND them with a set of code points and strings, you end up with just code points. That is well defined; you just need to caution the user that it will exclude strings, such as "aa". If a user wanted to do a broader match, s/he would write something like: $lowercaseLetterString = (\p{lowercase} \p{gc=non-spacing mark}*)* ; [\p{script=latin}\g{aa}] & $lowercaseLetterString Mark __________ http://www.macchiato.com ◄ “Eppur si muove” ► ----- Original Message ----- From: "Theo Veenker" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Sent: Tuesday, July 23, 2002 00:18 Subject: \p{} and \g{} in regexp > Hi, > > I have a few questions regarding unicode regular expressions. > > 1) I'm working on a regexp matcher and I'd like to know which properties > are never needed in a \p{...} item. Currently I have included the properties > listed below, but for efficiency reasons I'd like to trough out what isn't > really necessary: > > general category > bidi class ? > canonical combining class ? > decomposition type > line break > east asian width > arabic joining type ? > arabic joining group ? > script name > block name > age > numeric type > all binary properties > > So can anyone tell me if the marked properties are really usefull in > a \p{...} item? > > > 2) About grapheme clusters in a bracketed expression. It is clear what is > meant by an expression like [a-z\g{aa}]. But how do I interprete something > like [a-z\g{aa} & \p{foo}]. This reads as: accept any character in range > a-z or grapheme cluster aa, provided it has the foo property. The problem > is that \p{...} only applies to single code points, not to grapheme clusters. > > I can do three things: > 1. try if NFC of characters in \g{...} yields a single character and > work with that, otherwise fail > 2. only test first (base) character of the cluster > 3. don't allow use of operators & and - (i.e. &^) in a bracketed > expression in which one or more \g{...} are used > > What would be the most appropiate thing to do? > > Regards, > Theo > >