On Tuesday, 17 November 2015 at 13:56:14 UTC, Jay Norwood wrote:
I looked through the dataframe code and a couple of comments...
I had thought perhaps an app could read in the header info and
type info from hdf5, and generate D struct definitions with
column headers as symbol names. That would enable faster
processing than with the associative arrays, as well as support
the auto-completion that would be helpful in writing
expressions.
Yes - I think that one will want to have a choice between this
kind of approach and using associative arrays. Because for some
purposes it's not convenient to have to compile code every time
you open a strange file, and on the other hand the hit with an AA
sometimes will matter.
The situation at the moment for me is that I have very little
time to work on a correct general solution for this problem
myself (yet its important for D that we do get to one). I also
lack the experience with D to do it very well very quickly. I do
have a couple of seasoned people from the community helping me
with things, but dataframes won't be the first thing they look
at, and it could be a while before we get to that. If we
implement for our own needs,then I will open source it as it is
commercially sensible as well as the right thing to do. But that
could be a year away.
Vlad Levenfeld was also looking at this a bit.
The csv type info for columns could be inferred, or else stated
in the reader call, as done as an option in julia.
In both cases the column names would have to be valid symbol
names for this to work. I believe Julia also expects this, or
else does some conversion on your column names to make them
valid symbols. I think the D csv processing would also need to
check if the
The jupyter interactive environment supports python pandas and
Julia dataframe column names in the autocompletion, and so I
think the D debugging environment would need to provide similar
capability if it is to be considered as a fast-recompile
substitute for interactive dataframe exploration.
Well we don't need to get there in a single bound - already just
being able to do this at all is a big improvement, and I am
already using D with jupyter to do things.
It seems to me that your particular examples of stock data
would eventually need to handle missing data, as supported in
Julia dataframes and python pandas. They both provide ways to
drop or fill missing values. Did you want to support that?
Yes - we should do so eventually, and there's much more that
could be done. But maybe a sensible basic implementation is a
start and we can refine after that.
I wrote the dataframe in a couple of evenings, so I am sure it
can be improved, and even rearchitected. Pull requests welcomed,
and maybe we should set up a Trello to organise ideas ? Let me
know if you are in.