Tom Christiansen wrote:
> Has anybody specifically looked at how Perl6 regexes might map to
> the various requirements of UTS#18, Unicode Regular Expressions?
>
>     http://unicode.org/reports/tr18/

Roughly Perl6 supports Level 2 (I did a fast check of UTS#18 and the specs).

> I ask because to my inexperienced eye, quite a few perl6isms are
> *much* better at this than in perl5 obtain, and so I wondered
> whether this was by conscious intent and design.  Is/Was it?

Seems intended.

> I'm also curious whether there are active plans to address the
> tr18 requirements in perl6 regexes.  It would be a wonderful
> feather in perl6's cap to be able to legitimately claim Level 2
> or even Level 3 compliance, since besides perl5, only ICU right
> now manages even Level 1, with everybody else *very* far behind.

I would like to have:
- all Unicode features of Perl5 (UCD, charnames, Normalize, properties)
- most of the features of ICU (e.g. transforms, localisation)
- normalization form, local-support and tailored $features on string
  level (_not_ lexical context)

This means that any string can be in or can transformed into the form
- Byte
- NFD, NFC, NFKD, NFKC
- NFG (Default Grapheme Clusters)
- NFGT (Tailored Grapheme Cluster)

Tailored means, that the Graphem (NFG) needs a Language-Local (e.g. German-Swiss-Spelling_1996, or German-Austrian-PhoneBookCollation). Without a Language-Local a NFG-string is handled as NFG-string (default).

Tailored also means, that the user (Perl6 programmer) can tailor the relevant mechanisms (formatting, normalization, collation, properties, case folding etc.).

> TR18 specifies three levels of support (Basic, Extended, and Tailored),
> with each having specific, reasonably well-defined requirements:

There is a lot of work in the UNICODE standard - using it costs nothing, but saves time. E.g. the allowed characters for identifiers can be defined with the Unicode properties 'ID_Start' and 'ID_Continue', Grapheme with Grapheme_Base, Grapheme_Extend etc.

>   =Level 1: Basic Unicode Support
[...]
>    RL1.3    Subtraction and Intersection

IMHO not complete

>    RL1.5    Simple Loose Matches

Hmm ...

>    RL1.6    Line Boundaries

can be defined

>    RL1.7    Supplementary Code Points

IMHO not specced

>   =Level 2: Extended Unicode Support
>    RL2.1    Canonical Equivalents

IMHO not specced

> RL2.4 Default Loose Matches RL2.5 Name Properties RL2.6 Wildcard Properties

IMHO not specced

>   =Level 3: Tailored Unicode Support

IMHO not specced

It would be easier to reference the appropriate chapters of the Unicode standard in the specification of Perl6. This would make Unicode test-cases reusable. And an implementation should always declare, which features of Unicode are implemented (and which not) in which version of Unicode.

Helmut Wollmersdorfer

Reply via email to