Hi,

I have a few questions regarding unicode regular expressions.

1)  I'm working on a regexp matcher and I'd like to know which properties 
are never needed in a \p{...} item. Currently I have included the properties
listed below, but for efficiency reasons I'd like to trough out what isn't 
really necessary:

  general category
  bidi class                    ?
  canonical combining class     ?
  decomposition type
  line break
  east asian width
  arabic joining type           ?
  arabic joining group          ?
  script name
  block name
  age
  numeric type
  all binary properties

So can anyone tell me if the marked properties are really usefull in 
a \p{...} item?


2)  About grapheme clusters in a bracketed expression. It is clear what is
meant by an expression like [a-z\g{aa}]. But how do I interprete something
like [a-z\g{aa} & \p{foo}]. This reads as: accept any character in range 
a-z or grapheme cluster aa, provided it has the foo property. The problem
is that \p{...} only applies to single code points, not to grapheme clusters.

I can do three things:
  1. try if NFC of characters in \g{...} yields a single character and
     work with that, otherwise fail
  2. only test first (base) character of the cluster
  3. don't allow use of operators & and - (i.e. &^) in a bracketed 
     expression in which one or more \g{...} are used

What would be the most appropiate thing to do?

Regards,
Theo

Reply via email to