John M. Dlugosz wrote:
I was going over S02, and found it opens with, "By default Perl presents Unicode in "NFG" formation, where each grapheme counts as one character."

I looked up NFG, and found it to be an invention of this group, but didn't find any details when I tried to chase down the links.

As Durran already wrote, the only definition is in http://docs.parrot.org/parrot/latest/html/docs/pdds/pdd28_strings.pod.html which references 'Unicode Normalization Forms' http://www.unicode.org/reports/tr15/.

Also there is a reference to
"The Unicode Standard defines a grapheme cluster (commonly simplified to just grapheme)". IMHO the authors meant this document:

 Unicode Standard Annex #29
 Unicode Text Segmentation
 http://unicode.org/reports/tr29/

This opens a whole bunch of questions for me.

I have many unanswered questions [1] about graphemes.

If you mean that the default for what the individual items in a string are is graphemes, OK, but what does that have to do with parsing source code?

First - nothing. S01: "Perl 6 is written in Unicode." Developers can choose one of the encodings (UTF-8, UTF-17 etc.) for files with Perl source code. Characters outside the ASCII range can be used for identifiers, literals, and syntactic "punctuation" (e.g. 'bracketing pairs').

It's a problem of the parser to handle it correctly.

Even so, that's not something that would be called a Normalization Form.

Not in Unicode, but it can be called "Grapheme Composition".

Thus

\c[LATIN SMALL LETTER A, COMBINING DOT ABOVE, COMBINING DOT BELOW]
\c[LATIN SMALL LETTER A, COMBINING DOT BELOW, COMBINING DOT ABOVE]
\c[LATIN SMALL LETTER A WITH DOT ABOVE, COMBINING DOT BELOW]
\c[LATIN SMALL LETTER A WITH DOT BELOW, COMBINING DOT ABOVE]

should all lead to the same grapheme (my personal assumption).

Character set encodings and stuff is one of my strengths. I'd like to straighten this out, and can certainly straighten out the wording, but first need to know what you meant by that.

What's specified:
1) A grapheme is 1 character, thus has 'length' 1.
2) A grapheme has a unique internal representation as an integer for some life-time (process), outside the Unicode codepoints.
3) Graphemes can be normalized to NFD, NFC etc.

[1] Open questions:

1) Will graphemes have an unique charname?
   e.g. GRAPHEME LATIN SMALL LETTER A WITH DOT BELOW AND DOT ABOVE
2) Can I use Unicode property matching safely with graphemes?
   If yes, who or what maintains the necessary tables?
3) Details of 'life-time', round-trip.
4) Should the definition of graphemes conform to Unicode Standard Annex #29 'grapheme clusters'? Wich level - legacy, extended or tailored?

Helmut Wollmersdorfer


Reply via email to