On Wednesday, 14 October 2015 at 22:11:56 UTC, data pulverizer
wrote:
On Tuesday, 13 October 2015 at 23:26:14 UTC, Laeeth Isharc
wrote:
https://www.quora.com/Why-is-Python-so-popular-despite-being-so-slow
Andrei suggested posting more widely.
I am coming at D by way of R, C++, Python etc. so I speak as a
statistician who is interested in data science applications.
Welcome... Looks like we have similar interests.
To sit on the deployment side, D needs to grow it's big
data/noSQL infrastructure for a start, then hook into a whole
ecosystem of analytic tools in an easy and straightforward
manner. This will take a lot of work!
Indeed. The dlangscience project managed by John Colvin is very
interesting. It is not a pure stats project, but there will be
many shared areas of need. He has some v interesting ideas, and
being able to mix Python and D in a Jupyter notebook is rather
nice (you can do this already).
I believe it is easier and more effective to start on the
research side. D will need:
1. A data table structure like R's data.frame or data.table.
This is a dynamic data structure that represents a table that
can have lots of operations applied to it. It is the data
structure that separates R from most programming languages. It
is what pandas tries to emulate. This includes text file and
database i/o from mySQL and ODBC for a start.
I fully agree, and have made a very simple start on this. See
github. It's usable for my needs as they stand, although far from
production ready or elegant. You can read and write to/from CSV
and HDF5. I guess mysql and ODBC wouldn't be hard to add, but I
don't myself need for now and won't have time to do myself. If I
have space I may channel some reesources in that direction some
time next year.
2. Formula class : the ability to talk about statistical models
using formulas e.g. y ~ x1 + x2 + x3 etc and then use these
formulas to generate model matrices for input into statistical
algorithms.
Sounds interesting. Take a look at Colvin's dlang science draft
white paper, and see what you would add. It's a chance to shape
things whilst they are still fluid.
3. Solid interface to a big data database, that allows a D data
table <-> database easily
Which ones do you have in mind for stats? The different choices
seem to serve quite different needs. And when you say big data,
how big do you typically mean ?
4. Functional programming: especially around data table and
array structures. R's apply(), lapply(), tapply(), plyr and now
data.table(,, by = list()) provides powerful tools for data
manipulation.
Any thoughts on what the design should look like?
To an extent there is a balance between wanting to explore data
iteratively (when you don't know where you will end up), and
wanting to build a robust process for production. I have been
wondering myself about using LuaJIT to strap together D building
blocks for the exploration (and calling it based on a custom
console built around Adam Ruppe's terminal).
5. A factor data type:for categorical variables. This is easy
to implement! This ties into the creation of model matrices.
6. Nullable types makes talking about missing data more
straightforward and gives you the opportunity to code them into
a set value in your analysis. D is streaks ahead of Python
here, but this is built into R at a basic level.
So matrices with nullable types within? Is nan enough for you ?
If not then could be quite expensive if back end is C.
If D can get points 1, 2, 3 many people would be all over D
because it is a fantastic programming language and is wicked
fast.
What do you like best about it ? And in your own domain, what
have the biggest payoffs been in practice?