=head1 Overview This synopsis summarizes the non-existent Apocalypse 9, which discussed in detail the design of Perl 6 data structures. It was primarily a discussion of how the existing features of Perl 6 combine to make it easier for the PDL folks to write numeric Perl.
=head1 Lazy lists All list contexts are lazy by default. They still flatten eventually, but only when forced to. You have to use unary C<**> to get a non-lazy flattening list context (that is, to flatten immediately like Perl 5). =head1 Sized types Sized low-level types are named most generally by appending the number of bits to a generic low-level type name: int1 int2 int4 int8 int16 int32 (aka int on 32-bit machines) int64 (aka int on 64-bit machines) uint1 (aka bit) uint2 uint4 uint8 (aka byte) uint16 uint32 uint64 num32 num64 (aka num on most architectures) num128 complex32 complex64 (aka complex on most architectures) complex128 Complex sizes indicate the size of each C<num> component rather than the total. This would extend to tensor typenames as well if they're built-in types. Of course, the typical tensor structure is just reflected in the dimensions of the array--but the principle still holds that the name is based on the number of bits of the simple base type. The unsized types C<int> and C<num> are based on the architecture's normal size for C<int> and C<double> in whatever version of C the run-time system (presumably Parrot) is compiled in. So C<int> typically means C<int32> or C<int64>, while C<num> usually means C<num64>, and C<complex> means two of whatever C<num> turns out to be. You are, of course, free to use macros or type declarations to associate additional names, such as "short" or "single". These are not provided by default. An implementation of Perl is not required to support 64-bit integer types or 128-bit floating-point types unless the underlying architecture supports them. And yes, an C<int1> can store only -1 or 0. I'm sure someone'll think of a use for it... XXX Alternately we could go with a byte count rather than a bit count. But people seem to know whether they have a 32-bit or 64-bit processor. If you ask whether they have a 4-byte or 8-byte processor, they have to think about it a long time... XXX Plus this opens the door for types like uint5. Arguably, these should be declared as int[:range(0..31)] or some such, a là Ada. =head1 Compact structs A class whose attributes are all low-level types can behave as a struct. (Access from outside the class is still only through accessors, though.) Whether such a class is actually stored compactly is up to the implementation, but it ought to behave that way, at least to the extent that it's trivially easy (from the user's perspective) to read and write to the equivalent C structure. That is, when byte-stringified, it should look like the C struct, even if that's not how it's actually represented inside the class. (This is to be construed as a substitute for at least some of the current uses of C<pack>/C<unpack>.) =head1 Compact arrays In declarations of the form: my bit @bits; my int @ints; my num @nums; my int4 @nybbles; my str @buffers; my ref[Array] @ragged2d; my complex128 @longdoublecomplex; the presence of a low-level type tells Perl that it is free to implement the array with "compact storage", that is, with a chunk of memory containing contiguous (or as contiguous as practical) elements of the specified type without any fancy object boxing that typically applies to undifferentiated scalars. (Perl tries really hard to make these elements look like objects when you treat them like objects--this is called autoboxing.) The declarations above declare one-dimensional arrays of indeterminate length. Such arrays are autoextending just like ordinary Perl arrays (at the price of occasionally copying the block of data to another memory location). For many purposes, though, it's useful to define array types of a particular size and shape that, instead of autoextending, throw an exception if you try to access outside their declared dimensionality. Such arrays tend to be faster to allocate and access as well. A multidimensional array is indexed by a semicolon list, which is really a list of lists in disguise. Each sublist is a slice of one particular dimension. So @array[0..10; 42; @x] is really short for @array.postcircumfix:[]( <== [0..10], [42], [EMAIL PROTECTED] ); though in the list of lists form, a bare number is interpreted as if it were a list of one element, so you can also say: @array.postcircumfix:[]( <== [0..10], 42, [EMAIL PROTECTED] ); Note that at the comma level, a list such as: @[EMAIL PROTECTED],@y] is always interpreted as a one-dimensional slice in the outermost dimension, which is the same as: @[EMAIL PROTECTED],@y;] or more verbosely: @array.postcircumfix:[]( <== [EMAIL PROTECTED],@y] ); To interpolate an array at the semicolon level rather than the comma level, use the C<semi> list operator: @array[semi @x; @y] which is equivalent to @array.postcircumfix:[]( <== @x, [EMAIL PROTECTED] ); Note the difference between that and @array[semi @x, @y] which is the same as @array.postcircumfix:[]( <== @x, @y ); To declare a multidimensional array, you add a shape parameter: my num @nums is shape(3); # one dimension, @nums[0..2] my int @ints is shape(4;2); # two dimensions, @ints[0..3; 0..1] The argument to a shape specification is a semicolon list, just like the inside of a multidimensional subscript. Ranges are also allowed, so you can pretend you're programming in Fortran, or awk: my int @ints is shape(1..4;1..2); # two dimensions, @ints[1..4; 1..2] You can pass a list for the shape as well: my int @ints is shape(semi @foo.shape); Again, the C<semi> list operator interpolates a list into a semicolon list, which we do for consistency with subscript notation, not because it makes a great deal of sense to allow slices for dimensional specs (apart from ranges). So while the following is okay: my int @ints is shape(0,1,2,3,4); # same as 0..4 the following is a semantic error that the compiler should catch: my int @ints is shape(3,3,3); # oops, comma instead of semicolon The shape may be supplied entirely by the object at run-time: my num @nums = Array of num.new(:shape(3;3;3)); my num @nums .=new():shape(3;3;3); # same thing Any dimension of the array may be specified as C<*>, in which case that dimension will autoextend. Typically this would be used in the final dimension to make a ragged array functionally equivalent to an array of arrays: my int @ints is shape(42; *); push(@ints[41], getsomeints()); The shape may also be specified by types rather than sizes: my int @ints is shape(Even; Odd); or by both: my int @ints is shape(0..100 where Even; 1..99 where Odd); (presuming C<Even> and C<Odd> are types already constrained to be even or odd). =head1 PDL support An array C<@array> can be tied to a piddle at declaration time: my num @array is Piddle is shape(semi @mytensorshape); my @array is Piddle(:shape(2;2;2;2)) of int8; Piddles are allowed to assume a type of C<num> by default rather than the usual simple scalar. (And in general, the type info is merely made available to the "tie" implementation to do with what it will. Some data structures may ignore the "of" type and just store everything as general scalars. Too bad...) Arrays by default are one dimensional, but may be declared to have any dimensionality supported by the implementation. You may use arrays just like scalar references--the main caveat is that you have to use binding rather than assignment to set one without copying: @b := @a[0...:by(2)] With piddles in particular, this might alias each of the individual elements rather than the array as a whole. So modifications to @b are likely to be reflected back into @a. (But maybe the PDLers will prefer a different notation for that.) The dimensionality of an array may be declared on the variable, but the actual dimensionality of the array depends on how it was created. Reconciling these views is a job for the particular array implementation. It's not necessarily the case that the declared dimensionality must match the actual dimensionality. It's quite possible that the array variable is deliberately declared with a different dimensionality to provide a different "view" on the actual value: my int @array is Puddle is shape(2;2) .= new(:shape(4) <== 0,1,2,3); Again, reconciling those ideas is up to the implementation, C<Puddle> in this case. The traits system is flexible enough to pass any metadata required, including ideas about sparseness, raggedness, and various forms of non-rectangleness such as triangleness. The implementation should probably carp about any metadata it doesn't recognize though. The implementation is certainly free to reject any object that doesn't conform to the variable's shape requirements. =head1 Subscript and slice notation A subscript indicates a "slice" of an array. Each dimension of an array is sliced separately, so we say a subscript is a semicolon-separated list of slice specifiers. A three-dimensional slice might look like this: @x[0..10; 1,0; 1...:by(2)] It is up to the implementation of C<@x> to decide how aggressively or lazily this subscript is evaluated, and whether the slice entails copying. (The PDL folks will generally want it to merely produce a virtual piddle where the new array aliases its values back into the old one.) Of course, a single element can be selected merely by providing a single index value to each slice list: @x[0;1;42] =head1 The semicolon operator At the statement level, a semicolon terminates the current expression. Within any kind of bracketing construct, semicolon notionally produces a list of lists, the interpretation of which depends on the context. Such a semicolon list always provides list context to each of its sublists. The following two constructs are structurally indistinguishable: (0..10; 1,2,4; 3) ([0..10], [1,2,3,4], [3]) Of course, in known contexts such as array subscripts, the compiler is free to optimize away the actual construction of sublists where that's unnecessary. Single dimensional arrays expect simple slice subscripts, meaning they will treat a list subscript as a slice in the single dimension of the array. Multi-dimensional arrays, on the other hand, always expect a list of slice lists, one for each dimension. You need not specify all the dimensions; if you don't, the unspecified dimensions are "wildcarded". Supposing you have: my num @nums is shape(3;3;3); Then @nums[0..2] is the same as @nums[0..2;] which is the same as @nums[0,1,2;*;*] But you should maybe write the last form anyway just for good documentation, unless you don't actually know how many more dimensions there are. If you wanted that C<0..2> range to mean @nums[0;1;2] instead, then you need to use that C<semi> we keep mentioning: @nums[semi 0..2] The zero-dimensional slice: @x[] is assumed to want everything, not nothing. It's particularly handy because Perl 6 (unlike Perl 5) won't interpolate a bare array without brackets: @x = (1,2,3); say "@x = @x[]"; # prints @x = 1 2 3 Lists are lazy in Perl 6, and the slice lists are no exception. In particular, things like range objects are not flattened until they need to be, if ever. So a PDL implementation is free to steal the values from these ranges and "piddle" around with them: @nums[$min..$max:by(3)] @nums[$min..$max] @nums[$min...:by(3)] @nums[1...:by(2)] # the odds @nums[0...:by(2)] # the evens That's all just the standard Perl 6 notation for ranges. Additional syntactic relief is always available as long as it's predeclared somehow. It's possible the range operator could be taught that C<:2> means C<:by(2)>, for instance. (But I rather dislike the RFC-proposed C<0:10:2> notation that makes colon mean two different things so close together, plus it conflicts with Perl 6's general adverb notation if the next thing is alphabetic.) XXX Another ugly possibility is to overload something looser than C<..> (like C<+=>) to modify a range object: 0..10+=2 (always assuming that shouldn't just turn the range into C<2..12>.) Another thing that's not going to fly easily is simply dropping out terms. Perl depends rather heavily on knowing when it's expecting a term or an operator, and simply leaving out terms before or after a binary operator really screws that up. For instance, 0..:by(2) parses as 0 .. (by => 2) rather than 0 .. Inf :by(2) That why we have postfix C<...> to mean C<..Inf>. But then if you leave out the first argument: ...:by(2) you've written the yada-yada-yada operator, which is actually a term that will not produce an infinite range for you. Don't do that. Maybe you should just find some nice Unicode characters for your operators... =head1 PDL signatures To rewrite a Perl 5 PDL definition like this: pp_def( 'inner', Pars => 'a(n); b(n); [o]c(); ', # the signature, see above Code => 'double tmp = 0; loop(n) %{ tmp += $a() * $b(); %} $c() = tmp;' ); you might want to write a macro that parses something vaguely resembling this: role PDL_stuff[type $TYPE] { PDLsub inner (@a[$n], @b[$n]) returns(@c[]) { my ::{$TYPE} $tmp = 0; for 0..^$n { $tmp += @a[$_] * @b[$_]; } @c[] = tmp; } } where that turns into something like this: role PDL_stuff[type $TYPE] { multi sub inner (::{$TYPE} @a, ::{$TYPE} @b) returns(::{$TYPE}) { my $n = [EMAIL PROTECTED]; # or maybe $n is just a parameter assert($n == [EMAIL PROTECTED]); # and this is already checked by PDL my ::{$TYPE} $tmp = 0; for 0..^$n { $tmp += @a[$_] * @b[$_]; } return $tmp; } } Then any class that C<does PDL_stuff[num]> has an C<inner()> function that can (hopefully) be compiled down to a form useful to the PDL threading engine. Presumably the macro also stores away the PDL signature somewhere safe, since the translated code hides that information down in procedural code. Possibly some of the C<[n]> information can come back into the signature via C<where> constraints on the types. This would presumably make multimethod dispatch possible on similarly typed arrays with differing constraints. XXX It's not clear whether C<[EMAIL PROTECTED]> should return the size of the entire array or the size of the first dimension (or the scalar value of the entire array if it's really a zero-dimensional array!). I've put C<[EMAIL PROTECTED]> above just in case. (The special destruction problems of Perl 5's PDL should go away with Perl 6's GC approach, as long as PDL's objects are registered with Parrot correctly.) =head1 Junctions A junction is a superposition of data values pretending to be a single data value. Junctions come in four varieties: list op infix op ======= ======== any() | all() & one() ^ none() (no "nor" op defined) Note that the infix ops are "list-associative", insofar as $a | $b | $c $a & $b & $c $a ^ $b ^ $c mean any($a,$b,$c) all($a,$b,$c) one($a,$b,$c) rather than any(any($a,$b),$c) all(all($a,$b),$c) one(one($a,$b),$c) Some contexts, such as boolean contexts, have special rules for dealing with junctions. In any scalar context not expecting a junction of values, a junction produces automatic parallelization of the algorithm. In particular, if a junction is used as an argument to any routine (operator, closure, method, etc.), and the scalar parameter you are attempting to bind the argument to is inconsistent with the Junction type, that routine is "autothreaded", meaning the routine will be called automatically as many times as necessary to process the individual scalar elements of the junction in parallel. Junctions passed as part of a container do not cause autothreading unless individually pulled out and used as a scalar. It follows that junctions passed as members of a "slurpy" array or hash do not cause autothreading on that parameter. Only individually declared parameters may autothread. (Note that positional array and hash parameters are in fact scalar parameters, though, so you could pass a junction of array or hash references.) =head1 Parallelized parameters and autothreading Within the scope of a C<use autoindex> pragma (or equivalent, such as C<use PDL> (maybe)), any closure that uses parameters as subscripts is also a candidate for autothreading. For each such parameter, the compiler supplies a default value that is a junction of all possible values that subscript can take on. That is, if you have a closure of the form: -> $x, $y { @foo[$x;$y] } then the compiler adds defaults for you, something like: -> $x = @foo.shape[0].all, $y = @foo.shape[1].all { @foo[$x;$y] } In the abstract (and often in the concrete), this puts an implicit loop around the block of the closure that visits all the possible subscript values for that dimension (unless the parameter is actually supplied to the closure, in which case that is what is used as the slice subscript). So to write a typical tensor multiplication: Cijkl = Aij * Bkl you can just write this: do { @c[$^i, $^j, $^k, $^l] = @a[$^i, $^j] * @b[$^k, $^l] }; or equivalently: -> $i, $j, $k, $l { @c[$i, $j, $k, $l] = @a[$i, $j] * @b[$k, $l] }(); or even: do -> $i, $j, $k, $l { @c[$i, $j, $k, $l] = @a[$i, $j] * @b[$k, $l] } That's almost pretty. It is erroneous for an unbound parameter to match multiple existing array subscripts differently. (Arrays being created don't count.) Note that you could pass any of $i, $j, $k or $l explicitly, or prebind them with a C<.assuming> method, in which only the unbound parameters autothread. If you use an unbound array parameter as a semicolon-list interpolator (via the C<semi> list operator), it functions as a wildcard list of subscripts that must match the same everywhere that parameter is used. For example, do -> @wild { @b[semi reverse @wild] = @a[semi @wild]; }; produces an array with the dimensions reversed regardless of the dimensionality of C<@a>. The optimizer is, of course, free to optimize away any implicit loops that it can figure out how to do more efficiently without changing the semantics. See RFC 207 for more ideas on how to use autothreading (though the syntax proposed there is rather different). =head1 Hashes Everything we've said for arrays applies to hashes as well, except that if you're going to limit the keys of one dimension of a hash, you have to provide an explicit list of keys to that dimension of the shape, or an equivalent range: my num %hash is shape(«a b c d e f»; *); my num %hash is shape('a'..'f'; *); # same thing To declare a hash that can take any object as a key rather than just a string, say something like: my %hash is shape(Any); Likewise, you can limit the keys to objects of particular types: my Fight %hash is shape(Dog;Cat) =head1 Autosorted hashes The default hash iterator is a property called C<.iterator> that can be user replaced. When the hash itself needs an iterator for C<.pairs>, C<.keys>, C<.values>, or C<.kv>, it calls C<%hash.iterator()> to start one. In scalar context, C<.iterator> returns an iterator object. In list context, it returns a lazy list fed by the iterator. It must be possible for a hash to be in more than one iterator at at time, as long as the iterator state is stored in a lazy list. However, there is only one implicit iterator that works in scalar context to return the next pair. The downside to making a hash autosort via the iterator is that you'd have to store all the keys in sorted order, and resort it when the hash changes. Alternately, the entire hash could be tied to an ISAM implementation (not included (XXX or should it be?)). For multidimensional hashes, the key returned by any hash iterator is a list of keys, the size of which is the number of declared dimensions of the hash.