For people who don't care about multilingual support in their grammars, a
token type like this might be sufficient:
IDENTIFIER                   = <<[A-Za-z][A-Za-z0-9_]*>>

I wanted to expand this to support more than just the basic latin alphabet,
hoping it would be as easy as reformulating this expression into character
classes, now that Grammatica 1.5 apparently supports unicode regular
expressions. First I tried something like:

IDENTIFIER                   = <<[[:alpha:]][[:alnum:]]*>>

However it appears that Grammatica entirely ignores this type of structure,
instead treating it like a character set composed of :s and the letters a,
l, p, h, a, etc. So I found out about the \p{Class} formulation for property
classes...

IDENTIFIER                   = <<[\p{L&}][\p{L&}]*>>

The \p formulations like \p{L&} for these weren't working until I consulted
Java's own set of property classes, which list Alpha and Alnum, so it turned
into:
IDENTIFIER                   = <<[\p{Alpha}][\p{Alnum}]*>>

This made it past the grammar build, targetting .NET for the tokenizer code.
But when the compiler runs, from all appearances I gather that it is using
.NET's regular expression library, which does not accept {Alpha} and {Alnum}
but prefers the Unicode block names instead like {L&}.

As I wrote this email I came up with a solution (though it's not as pretty
as the formulations above:

IDENTIFIER                   =
<<[\p{Ll}\p{Lu}\p{Lt}][\p{Ll}\p{Lu}\p{Lt}\p{Nd}]*>>

This makes it through grammar compilation and parsing, matching the correct
input. It matches any alphabetic letter (upper, lower, title case) as the
first character, and any alphanumeric character for the rest.

The inability to use the Java style classes looks like a bug/oversight in
the C# port of Grammatica, unless I miss a more elegant way to pull this
off? More than likely a little fudge could be thrown in to translate the
{Alpha} class into Ll + Lu + Lt and {Number} into Nd.

But otherwise congrats on the new 1.5 release! I've been using 1.4 in my
compiler project for quite awhile, so it was nice to have a bit of freshness
and new features :-D

-- 
rezonant

long name: William Lahti
handle :: rezonant
freenode :: xfury
blog :: http://xfurious.blogspot.com/
site :: http://komodocorp.com/~wilahti
_______________________________________________
Grammatica-users mailing list
[email protected]
http://lists.nongnu.org/mailman/listinfo/grammatica-users

Reply via email to