Rakudo and non-ASCII character classes in rules

Aaron Sherman Tue, 25 May 2010 01:43:11 -0700

I came up with these tests which I though should work:

ok("π" ~~ /<[π]>/, "π as a character class");
ok("π" ~~ /<[\x03c0]>/, "π as a character class (hex)");
ok("π" ~~ /<[\x0391 .. \x03c9]>/, "π in a character class range");
ok("π" ~~ /\w/, "π as a word character");


Of those, only the first one actually did work. The others all fail.
Am I misunderstanding how these constructs should work?

PS: The reason I'm running into this is that my URI matcher for
RFC3987 needs to match many large blocks of characters (essentially
all non-ascii, valid Unicode codepoints) per the spec at:

http://www.ietf.org/rfc/rfc3987.txt

Specifically:

ucschar        = %xA0-D7FF / %xF900-FDCF / %xFDF0-FFEF
                  / %x10000-1FFFD / %x20000-2FFFD / %x30000-3FFFD
                  / %x40000-4FFFD / %x50000-5FFFD / %x60000-6FFFD
                  / %x70000-7FFFD / %x80000-8FFFD / %x90000-9FFFD
                  / %xA0000-AFFFD / %xB0000-BFFFD / %xC0000-CFFFD
                  / %xD0000-DFFFD / %xE1000-EFFFD

Which I tried to translate as:

        token ucschar {
            <+[\xA0 .. \xD7FF] + [\xF900 .. \xFDCF] + [\xFDF0 .. \xFFEF] +
            [\x10000 .. \x1FFFD] + [\x20000 .. \x2FFFD] +
            [\x30000 .. \x3FFFD] + [\x40000 .. \x4FFFD] +
            [\x50000 .. \x5FFFD] + [\x60000 .. \x6FFFD] +
            [\x70000 .. \x7FFFD] + [\x80000 .. \x8FFFD] +
            [\x90000 .. \x9FFFD] + [\xA0000 .. \xAFFFD] +
            [\xB0000 .. \xBFFFD] + [\xC0000 .. \xCFFFD] +
            [\xD0000 .. \xDFFFD] + [\xE1000 .. \xEFFFD]>
        }

But this refuses to match my test IRI's one-character path:

 http://www.example.com/π

--
Aaron Sherman
Email or GTalk: a...@ajs.com
http://www.ajs.com/~ajs

Rakudo and non-ASCII character classes in rules

Reply via email to