On Sun, Nov 21, 2010 at 6:02 PM, <josef.p...@gmail.com> wrote: > On Sun, Nov 21, 2010 at 5:09 PM, Keith Goodman <kwgood...@gmail.com> wrote: >> On Sun, Nov 21, 2010 at 12:30 PM, <josef.p...@gmail.com> wrote: >>> On Sun, Nov 21, 2010 at 2:48 PM, Keith Goodman <kwgood...@gmail.com> wrote: >>>> On Sun, Nov 21, 2010 at 10:25 AM, Wes McKinney <wesmck...@gmail.com> wrote: >>>>> On Sat, Nov 20, 2010 at 7:24 PM, Keith Goodman <kwgood...@gmail.com> >>>>> wrote: >>>>>> On Sat, Nov 20, 2010 at 3:54 PM, Wes McKinney <wesmck...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Keith (and others), >>>>>>> >>>>>>> What would you think about creating a library of mostly Cython-based >>>>>>> "domain specific functions"? So stuff like rolling statistical >>>>>>> moments, nan* functions like you have here, and all that-- NumPy-array >>>>>>> only functions that don't necessarily belong in NumPy or SciPy (but >>>>>>> could be included on down the road). You were already talking about >>>>>>> this on the statsmodels mailing list for larry. I spent a lot of time >>>>>>> writing a bunch of these for pandas over the last couple of years, and >>>>>>> I would have relatively few qualms about moving these outside of >>>>>>> pandas and introducing a dependency. You could do the same for larry-- >>>>>>> then we'd all be relying on the same well-vetted and tested codebase. >>>>>> >>>>>> I've started working on moving window statistics cython functions. I >>>>>> plan to make it into a package called Roly (for rolling). The >>>>>> signatures are: mov_sum(arr, window, axis=-1) and mov_nansum(arr, >>>>>> window, axis=-1), etc. >>>>>> >>>>>> I think of Nanny and Roly as two separate packages. A narrow focus is >>>>>> good for a new package. But maybe each package could be a subpackage >>>>>> in a super package? >>>>>> >>>>>> Would the function signatures in Nanny (exact duplicates of the >>>>>> corresponding functions in Numpy and Scipy) work for pandas? I plan to >>>>>> use Nanny in larry. I'll try to get the structure of the Nanny package >>>>>> in place. But if it doesn't attract any interest after that then I may >>>>>> fold it into larry. >>>>>> _______________________________________________ >>>>>> NumPy-Discussion mailing list >>>>>> NumPy-Discussion@scipy.org >>>>>> http://mail.scipy.org/mailman/listinfo/numpy-discussion >>>>>> >>>>> >>>>> Why make multiple packages? It seems like all these functions are >>>>> somewhat related: practical tools for real-world data analysis (where >>>>> observations are often missing). I suspect having everything under one >>>>> hood would create more interest than chopping things up-- would be >>>>> very useful to folks in many different disciplines (finance, >>>>> economics, statistics, etc.). In R, for example, NA-handling is just a >>>>> part of every day life. Of course in R there is a special NA value >>>>> which is distinct from NaN-- many folks object to the use of NaN for >>>>> missing values. The alternative is masked arrays, but in my case I >>>>> wasn't willing to sacrifice so much performance for purity's sake. >>>>> >>>>> I could certainly use the nan* functions to replace code in pandas >>>>> where I've handled things in a somewhat ad hoc way. >>>> >>>> A package focused on NaN-aware functions sounds like a good idea. I >>>> think a good plan would be to start by making faster, drop-in >>>> replacements for the NaN functions that are already in numpy and >>>> scipy. That is already a lot of work. After that, one possibility is >>>> to add stuff like nancumsum, nanprod, etc. After that moving window >>>> stuff? >>> >>> and maybe group functions after that? >> >> Yes, group functions are on my list. >> >>> If there is a lot of repetition, you could use templating. Even simple >>> string substitution, if it is only replacing the dtype, works pretty >>> well. It would at least reduce some copy-paste. >> >> Unit test coverage should be good enough to mess around with trying >> templating. What's a good way to go? Write my own script that creates >> the .pyx file and call it from the make file? Or are there packages >> for doing the templating? > > Depends on the scale, I tried once with simple string templates > http://codespeak.net/pipermail/cython-dev/2009-August/006614.html > > here is a pastbin of another version by ....(?), > http://pastebin.com/f1a49143d discussed on the cython-dev mailing > list. > > The cython list has the discussion every once in a while but I haven't > seen any conclusion yet. For heavier duty templating a proper > templating package (Jinja?) might be better. > > I'm not an expert. > > Josef > > >> >> I added nanmean (the first scipy function to enter nanny) and nanmin. >> _______________________________________________ >> NumPy-Discussion mailing list >> NumPy-Discussion@scipy.org >> http://mail.scipy.org/mailman/listinfo/numpy-discussion >> > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion@scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion >
What would you say to a single package that contains: - NaN-aware NumPy and SciPy functions (nanmean, nanmin, etc.) - moving window functions (moving_{count, sum, mean, var, std, etc.}) - core subroutines for labeled data - group-by functions - other things to add to this list? In other words, basic building computational tools for making libraries like larry, pandas, etc. and doing time series / statistical / other manipulations on real world (messy) data sets. The focus isn't so much "NaN-awareness" per se but more practical "data wrangling". I would be happy to work on such a package and to move all the Cython code I've written into it. There's a little bit of datarray overlap potentially but I think that's OK _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion