Hi everyone, The following are some suggestions. I'd love to help work on them. Sorry for the length.
- Zaki Mughal --- I announced adding data frames for PDL several months back <http://comments.gmane.org/gmane.comp.lang.perl.pdl.general/8335> and my intention to embed R in Perl. Embedding R in Perl is actually complete now and just about ready for CPAN <https://github.com/zmughal/embedding-r-in-perl-experiment> thanks to the help of the gang on #inline <http://inline.ouistreet.com/node/zfp7.html>. In order to build the data frames and match R types, I created several subclasses of PDL that handle a subset of PDL functions, but I haven't figured out a way to wrap all of PDL's functionality systematically. I have several thoughts on this. ## Levels of measurement When using R, one of the nice things it does is warn or give an error when you try to do an operation that would be invalid on a certain type of data. One such type of data is categorical data, which R calls factors and for which I made a subclass of PDL called PDL::Factor. Some of this behvaviour is inspired by the statistical methodology of levels of measurement <https://en.wikipedia.org/wiki/Level_of_measurement>. I believe SAS even explicitly allows assigning levels of measurment to variables. For example, if I try to apply the mean() function on all the columns of the Iris data set, I get this warning: ```r lapply( iris, mean ) #> $Sepal.Length #> [1] 5.843333 #> #> $Sepal.Width #> [1] 3.057333 #> #> $Petal.Length #> [1] 3.758 #> #> $Petal.Width #> [1] 1.199333 #> #> $Species #> [1] NA #> #> Warning message: #> In mean.default(X[[5L]], ...) : #> argument is not numeric or logical: returning NA ``` `NA` is R's equivalent of `BAD` values. For `mean()` this makes sense for categorical data. For logical vectors, it does something else: ```r which_setosa <- iris$Species == 'setosa' # this is a logical mean( which_setosa ) #> [1] 0.3333333 ``` This means 1/3 of the logical data was true which may be useful for `mean()` to return in that case. Thinking in terms of levels of measurement can help with another experiment I'm doing which based around tracking the units of measure used for numerical things in Perl. Code is here <https://github.com/zmughal/units-experiment/blob/master/overload_override.pl>. What I do there is use Moo roles to add a unit attribute to numerical types (Perl scalars, Number::Fraction, PDL, etc.) and whenever they go through an operation by either operator overloading or calling a function such as `sum()`, the unit will be carried along with it and be manipulated appropriately (you can take the mean of Kelvin, but not degrees Celsius). I know that units of measure are messy to implement, but being able to support auxiliary operations like this will go a long way to making PDL flexible. [Has anyone used udunits2? I made an Alien package for it. It's on CPAN.] ## DataShape and Blaze I think it would be beneficial to look at the work being done by the Blaze project <http://blaze.pydata.org/> with its DataShape specification <http://datashape.pydata.org/>. The idea behind it is to be able to use the various array-like APIs without having to worry what is going on in the backend — be it with a CPU-based, GPU-based, SciDB, or even a SQL server. ## Julia Julia has been doing some amazing things with how they've grown out their language. I was looking to see if they have anything similar to the dataflow in PDL and I came across ArrayViews <https://github.com/JuliaLang/ArrayViews.jl>. It may be enlightening to see how they compose this feature onto already existing n-d arrays as opposed to how PDL does it. I do not know what tradeoffs that brings, but it is a starting point to think about. I think similar approaches can be made to support sparse arrays. In fact, one of Julia's strengths is how they use multimethods to handle new types with ease. See "The Design Impact of Multiple Dispatch" <http://nbviewer.ipython.org/gist/StefanKarpinski/b8fe9dbb36c1427b9f22> for examples. [Perl 6 has built-in multimethods] ## MATLAB subclassing I use MATLAB daily. I came across this area of the documentation that talks about how to subclass. <http://www.mathworks.com/help/matlab/matlab_oop/subclassing-matlab-built-in-classes.html> Some of the information in there is good for knowing how *not* to implement things, but there is also some discussion on what is necessary for the storage types that might be worth looking at. [By the way, I have downloaded all of MATLAB File Central's code and I could do some analysis on the functions used there if that would be helpful.] ## GPU and threading I think it would be best to offload GPU support to other libraries, so it would be good to extract what is common between libraries like - MAGMA <http://icl.cs.utk.edu/magma/>, - ViennaCL <http://viennacl.sourceforge.net/>, - Blaze-lib <https://code.google.com/p/blaze-lib/>, - VXL <http://vxl.sourceforge.net/>, - Spark <http://spark.apache.org/>, - Torch <http://torch.ch/>, - Theano <http://www.deeplearning.net/software/theano/>, - Eigen <http://eigen.tuxfamily.org/>, and - Armadillo <http://arma.sourceforge.net/>. Eigen is interesting in particular because it has support for storing in both row-major and column-major data <http://eigen.tuxfamily.org/dox-devel/group__TopicStorageOrders.html>. Another source of inspiration would be the VSIPL spec <http://www.omgwiki.org/hpec/vsipl>. It's a standard made for signal processing routines in the embedded DSP world and comes with "Core" and "Core Lite" profiles which might help decide what should be included in a smaller subset of PDL. Also in my wishlist is interoperability with libraries like ITK <http://www.itk.org/>, VTK <http://www.vtk.org/>, and yt <http://yt-project.org/>. They have interesting architectures especially for computation. Unfortunately, the first two are C++ based and I don't have experience with combining C++ and XS. ## Better testing PDL should make more guarantees about how types flow through the system. This might be accomplished by adding assertions in the style of Design-by-Contract which can act as both a testable spec and documentation. I'm working on the test suite right now on a branch and I hope to create a proof-of-concept of this soon. I hope that this can help make PDL more consistent and easily testable. There are still small inconsistencies that shouldn't be there which can be weeded out with testing. For example, what type is expected for this code? : ```perl use PDL; print stretcher( sequence(float, 3) )->type; ``` I would expect 'float', but it is actually 'double' under PDL v2.007_04. ## Incremental computation I find that the way I grow my code is to slowly add modules that work together in a pipeline. Running and rerunning this code through all the modules is slow. To avoid that, I create multiple small programs that read and write files to pass from one script to the next. I was looking for a solution and came across IncPy <http://www.pgbovine.net/incpy.html>. It modifies the Python interpreter to support automatic persistent memoization. I don't think the idea has caught on, but I think it should and perhaps Perl and PDL is flexible enough to herald it as a CPAN module. _______________________________________________ Perldl mailing list Perldl@jach.hawaii.edu http://mailman.jach.hawaii.edu/mailman/listinfo/perldl