Also consider this recent addition to S02: Author: larry Date: Thu Jan 10 13:05:42 2008 New Revision: 14486
Modified: doc/trunk/design/syn/S02.pod Log: Added some random thoughts about performance implications of grapheme view Modified: doc/trunk/design/syn/S02.pod ============================================================================== --- doc/trunk/design/syn/S02.pod (original) +++ doc/trunk/design/syn/S02.pod Thu Jan 10 13:05:42 2008 @@ -12,9 +12,9 @@ Maintainer: Larry Wall <[EMAIL PROTECTED]> Date: 10 Aug 2004 - Last Modified: 5 Jan 2008 + Last Modified: 10 Jan 2008 Number: 2 - Version: 124 + Version: 125 This document summarizes Apocalypse 2, which covers small-scale lexical items and typological issues. (These Synopses also contain @@ -706,6 +706,41 @@ erroneous to pass such a non-dimensional number to a routine that would interpret it with the wrong units. +Implementation note: since Perl 6 mandates that the default Unicode +processing level must view graphemes as the fundamental unit rather +than codepoints, this has some implications regarding efficient +implementation. It is suggested that all graphames be translated on +input to a unique grapheme numbers and represented as integers within +some kind of uniform array for fast substr access. For those graphemes +that have a precomposed form, use of that codepoint is suggested. +(Note that this means Latin-1 can still be represented internally +with 8-bit integers.) + +For graphemes that have no precomposed form, a temporary private +id should be assigned that uniquely identifies the grapheme. +If such identifiers are assigned consistently thoughout the process, +comparison of two graphemes is no more difficult than the comparison +of two integers, and comparison of base characters no more different +than a direct lookup into the id-to-NFD table. + +Obviously, any temporary grapheme ids must be translated back to +some universal form (such as NFD) on output, and normal precomposed +graphemes may turn into either NFC or NFD forms depending on the +desired output. Maintaining a particular grapheme/id mapping over the +life of the process may have some GC implications for long-running +processes, but most processes will likely see a limited number of +non-precomposed graphemes. + +If the program has a scope that wants a codepoint view rather than +a grapheme view, the string visible to that lexical scope must also +be translated to universal form, just as with output translation. +Alternately, the temporary grapheme ids may be hidden behind an +abstraction layer. In any case, codepoint scope should never see +any temporary grapheme ids. (The lexical codepoint declaration +should probably specify which normalization form it prefers to +view strings under. Such a declaration could be applied to input +translation as well.) + =item * A C<Buf> is a stringish view of an array of