On Wednesday, 14 October 2015 at 22:11:56 UTC, data pulverizer wrote:
On Tuesday, 13 October 2015 at 23:26:14 UTC, Laeeth Isharc wrote:
https://www.quora.com/Why-is-Python-so-popular-despite-being-so-slow
Andrei suggested posting more widely.

I am coming at D by way of R, C++, Python etc. so I speak as a statistician who is interested in data science applications.

Welcome...  Looks like we have similar interests.

To sit on the deployment side, D needs to grow it's big data/noSQL infrastructure for a start, then hook into a whole ecosystem of analytic tools in an easy and straightforward manner. This will take a lot of work!

Indeed. The dlangscience project managed by John Colvin is very interesting. It is not a pure stats project, but there will be many shared areas of need. He has some v interesting ideas, and being able to mix Python and D in a Jupyter notebook is rather nice (you can do this already).

I believe it is easier and more effective to start on the research side. D will need:

1. A data table structure like R's data.frame or data.table. This is a dynamic data structure that represents a table that can have lots of operations applied to it. It is the data structure that separates R from most programming languages. It is what pandas tries to emulate. This includes text file and database i/o from mySQL and ODBC for a start.

I fully agree, and have made a very simple start on this. See github. It's usable for my needs as they stand, although far from production ready or elegant. You can read and write to/from CSV and HDF5. I guess mysql and ODBC wouldn't be hard to add, but I don't myself need for now and won't have time to do myself. If I have space I may channel some reesources in that direction some time next year.

2. Formula class : the ability to talk about statistical models using formulas e.g. y ~ x1 + x2 + x3 etc and then use these formulas to generate model matrices for input into statistical algorithms.

Sounds interesting. Take a look at Colvin's dlang science draft white paper, and see what you would add. It's a chance to shape things whilst they are still fluid.

3. Solid interface to a big data database, that allows a D data table <-> database easily

Which ones do you have in mind for stats? The different choices seem to serve quite different needs. And when you say big data, how big do you typically mean ?

4. Functional programming: especially around data table and array structures. R's apply(), lapply(), tapply(), plyr and now data.table(,, by = list()) provides powerful tools for data manipulation.

Any thoughts on what the design should look like?

To an extent there is a balance between wanting to explore data iteratively (when you don't know where you will end up), and wanting to build a robust process for production. I have been wondering myself about using LuaJIT to strap together D building blocks for the exploration (and calling it based on a custom console built around Adam Ruppe's terminal).

5. A factor data type:for categorical variables. This is easy to implement! This ties into the creation of model matrices.

6. Nullable types makes talking about missing data more straightforward and gives you the opportunity to code them into a set value in your analysis. D is streaks ahead of Python here, but this is built into R at a basic level.

So matrices with nullable types within? Is nan enough for you ? If not then could be quite expensive if back end is C.

If D can get points 1, 2, 3 many people would be all over D because it is a fantastic programming language and is wicked fast.
What do you like best about it ? And in your own domain, what have the biggest payoffs been in practice?


Reply via email to