Re: [julia-users] New Year's resolutions for DataArrays, DataFrames and other packages

Kevin Squire Wed, 22 Jan 2014 13:06:58 -0800

I'm also a fan of the expression-based interface (mostly because I'm used
to similar things in Pandas).  I haven't looked at that code, though, so I
can't comment on the complexity.


Kevin


On Wed, Jan 22, 2014 at 11:18 AM, Blake Johnson <blakejohnso...@gmail.com>wrote:

> Sure, but the resulting expression is *much* more verbose. I just noticed
> that all expression-based indexing was on the chopping block. What is left
> after all this?
>
> I can see how axing these features would make DataFrames.jl easier to
> maintain, but I found the expression stuff to present a rather nice
> interface.
>
> --Blake
>
>
> On Tuesday, January 21, 2014 11:51:03 AM UTC-5, John Myles White wrote:
>
>> Can you do something like df[“ColA”] = f(df)?
>>
>>  — John
>>
>>
>> On Jan 21, 2014, at 8:48 AM, Blake Johnson <blakejo...@gmail.com> wrote:
>>
>> I use within! pretty frequently. What should I be using instead if that
>> is on the chopping block?
>>
>> --Blake
>>
>> On Tuesday, January 21, 2014 7:42:39 AM UTC-5, tshort wrote:
>>>
>>> I also agree with your approach, John. Based on your criteria, here
>>> are some other things to consider for the chopping block.
>>>
>>> - expression-based indexing
>>> - NamedArray (you already have an issue on this)
>>> - with, within, based_on and variants
>>> - @transform, @DataFrame
>>> - select, filter
>>> - DataStream
>>>
>>> Many of these were attempts to ease syntax via delayed evaluation. We
>>> can either do without or try to implement something like LINQ.
>>>
>>>
>>>
>>> On Mon, Jan 20, 2014 at 7:02 PM, Kevin Squire <kevin....@gmail.com>
>>> wrote:
>>> > Hi John,
>>> >
>>> > I agree with pretty much everything you have written here, and really
>>> > appreciate that you've taken the lead in cleaning things up and
>>> getting us
>>> > on track.
>>> >
>>> > Cheers!
>>> >    Kevin
>>> >
>>> >
>>> > On Mon, Jan 20, 2014 at 1:57 PM, John Myles White <johnmyl...@
>>> gmail.com>
>>> > wrote:
>>> >>
>>> >> As I said in another thread recently, I am currently the lead
>>> maintainer
>>> >> of more packages than I can keep up with. I think it’s been useful
>>> for me to
>>> >> start so many different projects, but I can’t keep maintaining most
>>> of my
>>> >> packages given my current work schedule.
>>> >>
>>> >> Without Simon Kornblith, Kevin Squire, Sean Garborg and several others
>>>
>>> >> doing amazing work to keep DataArrays and DataFrames going, much of
>>> our
>>> >> basic data infrastructure would have already become completely
>>> unusable. But
>>> >> even with the great work that’s been done on those package recently,
>>> there’s
>>> >> still lot of additional design work required. I’d like to free up
>>> some of my
>>> >> time to do that work.
>>> >>
>>> >> To keep things moving forward, I’d like to propose a couple of
>>> radical New
>>> >> Year’s resolutions for the packages I work on.
>>> >>
>>> >> (1) We need to stop adding functionality and focus entirely on
>>> improving
>>> >> the quality and documentation of our existing functionality. We have
>>> way too
>>> >> much prototype code in DataFrames that I can’t keep up with. I’m
>>> about to
>>> >> make a pull request for DataFrames that will remove everything
>>> related to
>>> >> column groupings, database-style indexing and Blocks.jl support. I
>>> >> absolutely want to see us push all of those ideas forward in the
>>> future, but
>>> >> they need to happen in unmerged forks or separate packages until we
>>> have the
>>> >> resources needed to support them. Right now, they make an overwhelming
>>>
>>> >> maintenance challenge even more onerous.
>>> >>
>>> >> (2) We can’t support anything other than the master branch of most
>>> >> JuliaStats packages except possibly for Distributions. I personally
>>> don’t
>>> >> have the time to simultaneously keep stuff working with Julia 0.2 and
>>> Julia
>>> >> 0.3. Moreover, many of our basic packages aren’t mature enough to
>>> justify
>>> >> supporting older versions. We should do a better job of supporting our
>>>
>>> >> master releases and not invest precious time trying to support older
>>> >> releases.
>>> >>
>>> >> (3) We need to make more of DataArrays and DataFrames reflect the
>>> Julian
>>> >> worldview. Lots of our code uses an interface that is incongruous
>>> with the
>>> >> interfaces found in Base. Even worse, a large chunk of code has
>>> >> type-stability problems that makes it very slow, when comparable code
>>> that
>>> >> uses normal Arrays is 100x faster. We need to develop new idioms and
>>> new
>>> >> strategies for making code that interacts with type-destabilizing NA’s
>>>
>>> >> faster. More generally, we need to make DataArrays and DataFrames fit
>>> in
>>> >> better with Julia when Julia and R disagree. Following R’s lead has
>>> often
>>> >> lead us astray because R doesn’t share Julia’s strenths or weaknesses.
>>>
>>> >>
>>> >> (4) Going forward, there should be exactly one way to do most things.
>>> The
>>> >> worst part of our current codebase is that there are multiple ways to
>>>
>>> >> express the same computation, but (a) some of them are unusably slow
>>> and (b)
>>> >> some of them don’t ever get tested or maintained properly. This is
>>> closely
>>> >> linked to the excess proliferation of functionality described in
>>> Resolution
>>> >> 1 above. We need to start removing stuff from our packages and making
>>> the
>>> >> parts we keep both reliable and fast.
>>> >>
>>> >> I think we can push DataArrays and DataFrames to 1.0 status by the
>>> end of
>>> >> this year. But I think we need to adopt a new approach if we’re going
>>> to get
>>> >> there. Lots of stuff needs to get deprecated and what remains needs a
>>> lot
>>> >> more testing, benchmarking and documentation.
>>> >>
>>> >>  — John
>>> >>
>>> >
>>
>>
>>

Re: [julia-users] New Year's resolutions for DataArrays, DataFrames and other packages

Reply via email to