Re: \p{} and \g{} in regexp

Mark Davis Tue, 23 Jul 2002 11:01:27 -0700

1. Here is my take, if you are trying to slim down:

>   canonical combining class


This is really useful for matching. For example, if my source text is
NFD and I want to recognize whatever is canonically equivalent to
a-ring (with perhaps other accents), then I have to use something like
the following (syntax may vary, and I throw in some variables for
clarity):

$nonAboveAccent = [\p{ccc!=0}\p{ccc!=above}] ;
$ringAbove = \u030A ;

$pattern = [aA] $nonAboveAccent* $ringAbove ;

Note that it is more complicated to do the same thing in NFC; it looks
something like:

$aWithSomeRing = [aA\u01FA\u01FB] //
$aWithWithNonAboves = [ÀÁÂÃÄàáâãäĀā....] ; // many more

$pattern = ($aWithSomeRing | $aWithWithNonAboves $nonAboveAccent*
$ringAbove ) ;

>   bidi class ?
>   east asian width
>   arabic joining type ?
>   arabic joining group ?
>   line break

These are all similar, and have to do with eventual appearance on the
screen. If you are trying to match expressions based on one of these
eventual display features, then they are useful; otherwise, they
aren't particularly.

>   all binary properties
See http://www.unicode.org/Public/UNIDATA/PropertyAliases.txt

Of these, the Other_X are only contributory, and can be omitted.
The Expands_On_X are not, in my opinion, particularly useful, and
mainly included for historical reasons.

You may want to look at the ICU property support for 2.2 (in beta)
just for comparison. See
http://oss.software.ibm.com/cvs/icu/~checkout~/icuhtml/design/ucd_icu.
html. The items marked "U" are included in UnicodeSet which
corresponds to [...] in regular-expression engines.


2. I'll give a more concrete example: [\p{script=latin}\g{aa} &
\p{lowercase}]

The way we interpret this is that [\p{lowercase}] is a set of code
points, as you do. I wouldn't try anything fancier. When you AND them
with a set of code points and strings, you end up with just code
points. That is well defined; you just need to caution the user that
it will exclude strings, such as "aa". If a user wanted to do a
broader match, s/he would write something like:

$lowercaseLetterString = (\p{lowercase} \p{gc=non-spacing mark}*)* ;

[\p{script=latin}\g{aa}] & $lowercaseLetterString

Mark
__________
http://www.macchiato.com
◄  “Eppur si muove” ►

----- Original Message -----
From: "Theo Veenker" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Tuesday, July 23, 2002 00:18
Subject: \p{} and \g{} in regexp


> Hi,
>
> I have a few questions regarding unicode regular expressions.
>
> 1)  I'm working on a regexp matcher and I'd like to know which
properties
> are never needed in a \p{...} item. Currently I have included the
properties
> listed below, but for efficiency reasons I'd like to trough out what
isn't
> really necessary:
>
>   general category
>   bidi class ?
>   canonical combining class ?
>   decomposition type
>   line break
>   east asian width
>   arabic joining type ?
>   arabic joining group ?
>   script name
>   block name
>   age
>   numeric type
>   all binary properties
>
> So can anyone tell me if the marked properties are really usefull in
> a \p{...} item?
>
>
> 2)  About grapheme clusters in a bracketed expression. It is clear
what is
> meant by an expression like [a-z\g{aa}]. But how do I interprete
something
> like [a-z\g{aa} & \p{foo}]. This reads as: accept any character in
range
> a-z or grapheme cluster aa, provided it has the foo property. The
problem
> is that \p{...} only applies to single code points, not to grapheme
clusters.
>
> I can do three things:
>   1. try if NFC of characters in \g{...} yields a single character
and
>      work with that, otherwise fail
>   2. only test first (base) character of the cluster
>   3. don't allow use of operators & and - (i.e. &^) in a bracketed
>      expression in which one or more \g{...} are used
>
> What would be the most appropiate thing to do?
>
> Regards,
> Theo
>
>

Re: \p{} and \g{} in regexp

Reply via email to