Hi, I have a few questions regarding unicode regular expressions.
1) I'm working on a regexp matcher and I'd like to know which properties are never needed in a \p{...} item. Currently I have included the properties listed below, but for efficiency reasons I'd like to trough out what isn't really necessary: general category bidi class ? canonical combining class ? decomposition type line break east asian width arabic joining type ? arabic joining group ? script name block name age numeric type all binary properties So can anyone tell me if the marked properties are really usefull in a \p{...} item? 2) About grapheme clusters in a bracketed expression. It is clear what is meant by an expression like [a-z\g{aa}]. But how do I interprete something like [a-z\g{aa} & \p{foo}]. This reads as: accept any character in range a-z or grapheme cluster aa, provided it has the foo property. The problem is that \p{...} only applies to single code points, not to grapheme clusters. I can do three things: 1. try if NFC of characters in \g{...} yields a single character and work with that, otherwise fail 2. only test first (base) character of the cluster 3. don't allow use of operators & and - (i.e. &^) in a bracketed expression in which one or more \g{...} are used What would be the most appropiate thing to do? Regards, Theo