Re: "Unicode in 'NFG' formation" ?

Helmut Wollmersdorfer Mon, 18 May 2009 02:15:01 -0700

John M. Dlugosz wrote:

I was going over S02, and found it opens with, "By default Perl presentsUnicode in "NFG" formation, where each grapheme counts as one character."

I looked up NFG, and found it to be an invention of this group, butdidn't find any details when I tried to chase down the links.

As Durran already wrote, the only definition is inhttp://docs.parrot.org/parrot/latest/html/docs/pdds/pdd28_strings.pod.htmlwhich references 'Unicode Normalization Forms'http://www.unicode.org/reports/tr15/.


Also there is a reference to

"The Unicode Standard defines a grapheme cluster (commonly simplified tojust grapheme)". IMHO the authors meant this document:


 Unicode Standard Annex #29
 Unicode Text Segmentation
 http://unicode.org/reports/tr29/

This opens a whole bunch of questions for me.


I have many unanswered questions [1] about graphemes.

If you mean that thedefault for what the individual items in a string are is graphemes, OK,but what does that have to do with parsing source code?

First - nothing. S01: "Perl 6 is written in Unicode." Developers canchoose one of the encodings (UTF-8, UTF-17 etc.) for files with Perlsource code. Characters outside the ASCII range can be used foridentifiers, literals, and syntactic "punctuation" (e.g. 'bracketingpairs').


It's a problem of the parser to handle it correctly.

Even so, that'snot something that would be called a Normalization Form.


Not in Unicode, but it can be called "Grapheme Composition".

Thus

\c[LATIN SMALL LETTER A, COMBINING DOT ABOVE, COMBINING DOT BELOW]
\c[LATIN SMALL LETTER A, COMBINING DOT BELOW, COMBINING DOT ABOVE]
\c[LATIN SMALL LETTER A WITH DOT ABOVE, COMBINING DOT BELOW]
\c[LATIN SMALL LETTER A WITH DOT BELOW, COMBINING DOT ABOVE]

should all lead to the same grapheme (my personal assumption).

Character set encodings and stuff is one of my strengths. I'd like tostraighten this out, and can certainly straighten out the wording, butfirst need to know what you meant by that.


What's specified:
1) A grapheme is 1 character, thus has 'length' 1.

2) A grapheme has a unique internal representation as an integer forsome life-time (process), outside the Unicode codepoints.

3) Graphemes can be normalized to NFD, NFC etc.

[1] Open questions:

1) Will graphemes have an unique charname?
   e.g. GRAPHEME LATIN SMALL LETTER A WITH DOT BELOW AND DOT ABOVE
2) Can I use Unicode property matching safely with graphemes?
   If yes, who or what maintains the necessary tables?
3) Details of 'life-time', round-trip.

4) Should the definition of graphemes conform to Unicode Standard Annex#29 'grapheme clusters'? Wich level - legacy, extended or tailored?


Helmut Wollmersdorfer

Re: "Unicode in 'NFG' formation" ?

Reply via email to