On Sun, Sep 21, 2014 at 7:50 PM, Stephan Hoyer <sho...@gmail.com> wrote: > pandas has some hacks to support custom types of data for which numpy can't > handle well enough or at all. Examples include datetime and Categorical [1], > and others like GeoArray [2] that haven't make it into pandas yet. > > Most of these look like numpy arrays but with custom dtypes and type > specific methods/properties. But clearly nobody is particularly excited > about writing the the C necessary to implement custom dtypes [3]. Nor is do > we need the ndarray ABI. > > In many cases, writing C may not actually even be necessary for performance > reasons, e.g., categorical can be fast enough just by wrapping an integer > ndarray for the internal storage and using vectorized operations. And even > if it is necessary, I think we'd all rather write Cython than C. > > It's great for pandas to write its own ndarray-like wrappers (*not* > subclasses) that work with pandas, but it's a shame that there isn't a > standard interface like the ndarray to make these arrays useable for the > rest of the scientific Python ecosystem. For example, pandas has loads of > fixes for np.datetime64, but nobody seems to be up for porting them to numpy > (I doubt it would be easy).
Writing them in the first place probably wasn't easy either :-). I don't really know why pandas spends so much effort on reimplementing stuff and papering over numpy limitations instead of fixing things upstream so that everyone can benefit. I assume they have reasons, and I could make some general guesses at what some of them might be, but if you want to know what they are -- which is presumably the first step in changing the situation -- you'll have to ask them, not us :-). > I know these sort of concerns are not new, but I wish I had a sense of what > the solution looks like. Is anyone actively working on these issues? Does > the fix belong in numpy, pandas, blaze or a new project? I'd love to get a > sense of where things stand and how I could help -- without writing any C > :). I think there are there are three parts: For stuff that's literally just fixing bugs in stuff that numpy already has, then we'd certainly be happy to accept those bug fixes. Probably there are things we can do to make this easier, I dunno. I'd love to see some of numpy's internals moving into Cython to make them easier to hack on, but this won't be simple because right now using Cython to implement a module is really an all-or-nothing affair; making it possible to mix Cython with numpy's existing C code will require upstream changes in Cython. For cases where people genuinely want to implement a new array-like types (e.g. DataFrame or scipy.sparse) then numpy provides a fair amount of support for this already (e.g., the various hooks that allow things like np.asarray(mydf) or np.sin(mydf) to work), and we're working on adding more over time (e.g., __numpy_ufunc__). My feeling though is that in most of the cases you mention, implementing a new array-like type is huge overkill. ndarray's interface is vast and reimplementing even 90% of it is a huge effort. For most of the cases that people seem to run into in practice, the solution is to enhance numpy's dtype interface so that it's possible for mere mortals to implement new dtypes, e.g. by just subclassing np.dtype. This is totally doable and would enable a ton of awesomeness, but it requires someone with the time to sit down and work on it, and no-one has volunteered yet. Unfortunately it does require hacking on C code though. -- Nathaniel J. Smith Postdoctoral researcher - Informatics - University of Edinburgh http://vorpus.org _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion