Re: Repeated Loopy Variable Width String Character Access is Slooooow-ish

Larry Wall Thu, 10 Jan 2008 13:52:48 -0800

Also consider this recent addition to S02:

Author: larry
Date: Thu Jan 10 13:05:42 2008
New Revision: 14486


Modified:
   doc/trunk/design/syn/S02.pod

Log:
Added some random thoughts about performance implications of grapheme view


Modified: doc/trunk/design/syn/S02.pod
==============================================================================
--- doc/trunk/design/syn/S02.pod        (original)
+++ doc/trunk/design/syn/S02.pod        Thu Jan 10 13:05:42 2008
@@ -12,9 +12,9 @@
 
   Maintainer: Larry Wall <[EMAIL PROTECTED]>
   Date: 10 Aug 2004
-  Last Modified: 5 Jan 2008
+  Last Modified: 10 Jan 2008
   Number: 2
-  Version: 124
+  Version: 125
 
 This document summarizes Apocalypse 2, which covers small-scale
 lexical items and typological issues.  (These Synopses also contain
@@ -706,6 +706,41 @@
 erroneous to pass such a non-dimensional number to a routine that
 would interpret it with the wrong units.
 
+Implementation note: since Perl 6 mandates that the default Unicode
+processing level must view graphemes as the fundamental unit rather
+than codepoints, this has some implications regarding efficient
+implementation.  It is suggested that all graphames be translated on
+input to a unique grapheme numbers and represented as integers within
+some kind of uniform array for fast substr access.  For those graphemes
+that have a precomposed form, use of that codepoint is suggested.
+(Note that this means Latin-1 can still be represented internally
+with 8-bit integers.)
+
+For graphemes that have no precomposed form, a temporary private
+id should be assigned that uniquely identifies the grapheme.
+If such identifiers are assigned consistently thoughout the process,
+comparison of two graphemes is no more difficult than the comparison
+of two integers, and comparison of base characters no more different
+than a direct lookup into the id-to-NFD table.
+
+Obviously, any temporary grapheme ids must be translated back to
+some universal form (such as NFD) on output, and normal precomposed
+graphemes may turn into either NFC or NFD forms depending on the
+desired output.  Maintaining a particular grapheme/id mapping over the
+life of the process may have some GC implications for long-running
+processes, but most processes will likely see a limited number of
+non-precomposed graphemes.
+
+If the program has a scope that wants a codepoint view rather than
+a grapheme view, the string visible to that lexical scope must also
+be translated to universal form, just as with output translation.
+Alternately, the temporary grapheme ids may be hidden behind an
+abstraction layer.  In any case, codepoint scope should never see
+any temporary grapheme ids.  (The lexical codepoint declaration
+should probably specify which normalization form it prefers to
+view strings under.  Such a declaration could be applied to input
+translation as well.)
+
 =item *
 
 A C<Buf> is a stringish view of an array of

Re: Repeated Loopy Variable Width String Character Access is Slooooow-ish

Reply via email to