Re: [julia-users] New Year's resolutions for DataArrays, DataFrames and other packages

Kevin Squire Wed, 22 Jan 2014 14:50:05 -0800

Maybe I misinterpreted the term "expression-based interface".


On Wed, Jan 22, 2014 at 2:33 PM, John Myles White
<[email protected]>wrote:

> My impression is that Pandas didn't support anything like delayed
> evaluation. Is that wrong?
>
> I'm aware that the resulting expressions are a lot more verbose. That
> definitely sucks.
>
> I'd love to see strong proposals for how we're going to do a better job of
> making code shorter going forward. But too much of our current codebase is
> buggy, unable to handle edge cases, slow and undocumented. I think it's
> much more important that we have one way of doing things that actually
> works as advertised for every Julia user than two ways of doing things,
> each of which is slightly broken and performs worse than R and Pandas.
>
> As I've been saying lately, I'm burning out on maintaing so much Julia
> code. If someone else wants to take charge of my projects, I'm ok with
> that. But if I'm going to be doing the work going forward, I need to devote
> my energies to making a small number of things work really well. Once we
> get our core functionality solid, I'll be comfortable getting fancier stuff
> working again.
>
>  -- John
>
> On Jan 22, 2014, at 1:06 PM, Kevin Squire <[email protected]> wrote:
>
> I'm also a fan of the expression-based interface (mostly because I'm used
> to similar things in Pandas).  I haven't looked at that code, though, so I
> can't comment on the complexity.
>
> Kevin
>
>
> On Wed, Jan 22, 2014 at 11:18 AM, Blake Johnson 
> <[email protected]>wrote:
>
>> Sure, but the resulting expression is *much* more verbose. I just
>> noticed that all expression-based indexing was on the chopping block. What
>> is left after all this?
>>
>> I can see how axing these features would make DataFrames.jl easier to
>> maintain, but I found the expression stuff to present a rather nice
>> interface.
>>
>> --Blake
>>
>>
>> On Tuesday, January 21, 2014 11:51:03 AM UTC-5, John Myles White wrote:
>>
>>> Can you do something like df[“ColA”] = f(df)?
>>>
>>>  — John
>>>
>>>
>>> On Jan 21, 2014, at 8:48 AM, Blake Johnson <[email protected]> wrote:
>>>
>>> I use within! pretty frequently. What should I be using instead if that
>>> is on the chopping block?
>>>
>>> --Blake
>>>
>>> On Tuesday, January 21, 2014 7:42:39 AM UTC-5, tshort wrote:
>>>>
>>>> I also agree with your approach, John. Based on your criteria, here
>>>> are some other things to consider for the chopping block.
>>>>
>>>> - expression-based indexing
>>>> - NamedArray (you already have an issue on this)
>>>> - with, within, based_on and variants
>>>> - @transform, @DataFrame
>>>> - select, filter
>>>> - DataStream
>>>>
>>>> Many of these were attempts to ease syntax via delayed evaluation. We
>>>> can either do without or try to implement something like LINQ.
>>>>
>>>>
>>>>
>>>> On Mon, Jan 20, 2014 at 7:02 PM, Kevin Squire <[email protected]>
>>>> wrote:
>>>> > Hi John,
>>>> >
>>>> > I agree with pretty much everything you have written here, and really
>>>>
>>>> > appreciate that you've taken the lead in cleaning things up and
>>>> getting us
>>>> > on track.
>>>> >
>>>> > Cheers!
>>>> >    Kevin
>>>> >
>>>> >
>>>> > On Mon, Jan 20, 2014 at 1:57 PM, John Myles White <johnmyl...@
>>>> gmail.com>
>>>> > wrote:
>>>> >>
>>>> >> As I said in another thread recently, I am currently the lead
>>>> maintainer
>>>> >> of more packages than I can keep up with. I think it’s been useful
>>>> for me to
>>>> >> start so many different projects, but I can’t keep maintaining most
>>>> of my
>>>> >> packages given my current work schedule.
>>>> >>
>>>> >> Without Simon Kornblith, Kevin Squire, Sean Garborg and several
>>>> others
>>>> >> doing amazing work to keep DataArrays and DataFrames going, much of
>>>> our
>>>> >> basic data infrastructure would have already become completely
>>>> unusable. But
>>>> >> even with the great work that’s been done on those package recently,
>>>> there’s
>>>> >> still lot of additional design work required. I’d like to free up
>>>> some of my
>>>> >> time to do that work.
>>>> >>
>>>> >> To keep things moving forward, I’d like to propose a couple of
>>>> radical New
>>>> >> Year’s resolutions for the packages I work on.
>>>> >>
>>>> >> (1) We need to stop adding functionality and focus entirely on
>>>> improving
>>>> >> the quality and documentation of our existing functionality. We have
>>>> way too
>>>> >> much prototype code in DataFrames that I can’t keep up with. I’m
>>>> about to
>>>> >> make a pull request for DataFrames that will remove everything
>>>> related to
>>>> >> column groupings, database-style indexing and Blocks.jl support. I
>>>> >> absolutely want to see us push all of those ideas forward in the
>>>> future, but
>>>> >> they need to happen in unmerged forks or separate packages until we
>>>> have the
>>>> >> resources needed to support them. Right now, they make an
>>>> overwhelming
>>>> >> maintenance challenge even more onerous.
>>>> >>
>>>> >> (2) We can’t support anything other than the master branch of most
>>>> >> JuliaStats packages except possibly for Distributions. I personally
>>>> don’t
>>>> >> have the time to simultaneously keep stuff working with Julia 0.2
>>>> and Julia
>>>> >> 0.3. Moreover, many of our basic packages aren’t mature enough to
>>>> justify
>>>> >> supporting older versions. We should do a better job of supporting
>>>> our
>>>> >> master releases and not invest precious time trying to support older
>>>>
>>>> >> releases.
>>>> >>
>>>> >> (3) We need to make more of DataArrays and DataFrames reflect the
>>>> Julian
>>>> >> worldview. Lots of our code uses an interface that is incongruous
>>>> with the
>>>> >> interfaces found in Base. Even worse, a large chunk of code has
>>>> >> type-stability problems that makes it very slow, when comparable
>>>> code that
>>>> >> uses normal Arrays is 100x faster. We need to develop new idioms and
>>>> new
>>>> >> strategies for making code that interacts with type-destabilizing
>>>> NA’s
>>>> >> faster. More generally, we need to make DataArrays and DataFrames
>>>> fit in
>>>> >> better with Julia when Julia and R disagree. Following R’s lead has
>>>> often
>>>> >> lead us astray because R doesn’t share Julia’s strenths or
>>>> weaknesses.
>>>> >>
>>>> >> (4) Going forward, there should be exactly one way to do most
>>>> things. The
>>>> >> worst part of our current codebase is that there are multiple ways to
>>>>
>>>> >> express the same computation, but (a) some of them are unusably slow
>>>> and (b)
>>>> >> some of them don’t ever get tested or maintained properly. This is
>>>> closely
>>>> >> linked to the excess proliferation of functionality described in
>>>> Resolution
>>>> >> 1 above. We need to start removing stuff from our packages and
>>>> making the
>>>> >> parts we keep both reliable and fast.
>>>> >>
>>>> >> I think we can push DataArrays and DataFrames to 1.0 status by the
>>>> end of
>>>> >> this year. But I think we need to adopt a new approach if we’re
>>>> going to get
>>>> >> there. Lots of stuff needs to get deprecated and what remains needs
>>>> a lot
>>>> >> more testing, benchmarking and documentation.
>>>> >>
>>>> >>  — John
>>>> >>
>>>> >
>>>
>>>
>>>
>
>

Re: [julia-users] New Year's resolutions for DataArrays, DataFrames and other packages

Reply via email to