On Tue, Jul 6, 2010 at 12:56 PM, Keith Goodman <kwgood...@gmail.com> wrote: > On Tue, Jul 6, 2010 at 9:52 AM, Joshua Holbrook <josh.holbr...@gmail.com> > wrote: >> On Tue, Jul 6, 2010 at 8:42 AM, Skipper Seabold <jsseab...@gmail.com> wrote: >>> On Tue, Jul 6, 2010 at 12:36 PM, Joshua Holbrook >>> <josh.holbr...@gmail.com> wrote: >>>> I'm kinda-sorta still getting around to building/reading the sphinx >>>> docs for datarray. <_< Like, I've gone through them before, but it was >>>> more cursory than I'd like. Honestly, I kinda let myself get caught up >>>> in trying to automate the process of getting them onto github pages. >>>> >>>> I have to admit that I didn't 100% understand the reasoning behind not >>>> allowing integer ticks (I blame jet lag--it's a nice scapegoat). I >>>> believe it originally had to do with what you meant if you typed, say, >>>> A[3:"london"]; Did you mean the underlying ndarray index 3, or the >>>> outer level "tick" 3? I think if you didn't allow integers, then you >>>> could simply wrap your "3" in a string: A["3":"London"] so it's >>>> probably not a deal-breaker, but I would imagine that using (a) >>>> separate method(s) for label-based indexing may make allowing >>>> integer-datatyped labels. >>>> >>>> Thoughts? >>> >>> Would you mind bottom-posting/ posting in-line to make the thread >>> easier to follow? >>> >>>> >>>> --Josh >>>> >>>> On Tue, Jul 6, 2010 at 8:23 AM, Keith Goodman <kwgood...@gmail.com> wrote: >>>>> On Tue, Jul 6, 2010 at 9:13 AM, Skipper Seabold <jsseab...@gmail.com> >>>>> wrote: >>>>>> On Tue, Jul 6, 2010 at 11:55 AM, Keith Goodman <kwgood...@gmail.com> >>>>>> wrote: >>>>>>> On Tue, Jul 6, 2010 at 7:47 AM, Joshua Holbrook >>>>>>> <josh.holbr...@gmail.com> wrote: >>>>>>>> I really really really want to work on this. I already forked datarray >>>>>>>> on github and did some research on What Other People Have Done ( >>>>>>>> http://jesusabdullah.github.com/2010/07/02/datarray.html ). With any >>>>>>>> luck I'll contribute something actually useful. :) >>>>>>> >>>>>>> I like the figure! >>>>>>> >>>>>>> To do label indexing on a larry you need to use lix, so lar.lix[...] >>>>>> >>>>>> FYI, if you didn't see it, there are also usage docs in dataarray/doc >>>>>> that you can build with sphinx that show a lot of the thinking and >>>>>> examples (they spent time looking at pandas and larry). >>>>>> >>>>>> One question that was asked of Wes, that I'd propose to you as well >>>>>> Keith, is that if DataArray became part of NumPy, do you think you >>>>>> could use it to work on top of for larry? >>>>> >>>>> This is all very exciting. I did not know that DataArray had ticks so >>>>> I never took a close look at it. >>>>> >>>>> After reading the sphinx doc, one question I had was how firm is the >>>>> decision to not allow integer ticks? I use int ticks a lot. >>> >>> I think what Josh said is right. However, we proposed having all of >>> the new labeled axis access pushed to a .aix (or whatever) method, so >>> as to avoid any confusion, as the original object can be accessed just >>> as an ndarray. I'm not sure where this leaves us vis-a-vis ints as >>> ticks. >>> >>> Skipper >>> _______________________________________________ >>> NumPy-Discussion mailing list >>> NumPy-Discussion@scipy.org >>> http://mail.scipy.org/mailman/listinfo/numpy-discussion >>> >> >> Sorry re: posting at-top. I guess habit surpassed observation of >> community norms for a second there. Whups! >> >> My opinion on the matter is that, as a matter of "purity," labels >> should all have the string datatype. That said, I'd imagine that >> passing an int as an argument would be fine, due to python's >> loosey-goosey attitude towards datatypes. :) That, or, y'know, >> str(myint). > > Ideally (for me), the only requirement for ticks would be hashable and > unique along any one axis. So, for example, datetime.date() could be a > tick but a list could not be a tick (not hashable). > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion@scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion >
Gmail needs to really get its act and enable bottom-posting by default. Definitely an annoyance There are many issues at play here so I wanted to give some of my thoughts re: building pandas, larry, etc. on top of DataArray (or whatever it is that makes its way into NumPy), can put this on the wiki, too: 1. Giving semantic information to axes (not ticks, though) I think this is very useful but wouldn't be immediately useful in pandas except perhaps moving axis names elsewhere (which are currently a part of the data-structures and always have the same name). I wouldn't be immediately comfortable say, making a pandas DataFrame a subclass of DataArray and making them implicitly interoperable. Going back and forth e.g. from DataArray and DataFrame *should* be an easy operation-- you could imagine using DataArray to serialize both pandas and larry objects for example! 2. Container for axis metadata (Axis object in datarray, Index in pandas, ...) I would be more than happy to offload the "ordered set" data structure onto NumPy. In pandas, Index is that container-- it's an ndarray subclass with a handful of methods and a reverse index (e.g. if you have ['d', 'b', 'a' 'c'] you have a dict somewhere with {'d' : 0, 'b' : 1, ...} for O(1) lookups). I'm producing the reverse index in Cython at object creation time-- Keith recently added the same thing (Cython) to larry to get a speed boost, but he does it only when needed. It's also nice to have some other convenience methods in this object, like set operations. In pandas, there is also the DateRange class (subclass of Index, so recognized as valid by the data structures) which has a sequence of Python datetime objects and frequency information. IMHO this should all go inside NumPy and leverage the datetime64 dtype. With date ranges you can also special case set operations (e.g. union or intersection) when the ranges overlap (in practice this can yield a huge performance boost)! I like using ndarray for the ticks because slicing produces views, etc. (but in the current implementation in pandas slicing requires constructing a new reverse index from scratch). As for the acceptable type for ticks-- I am with Keith in requiring only hashability. So to support integer ticks for completeness DataArray probably needs a separate "access by tick" interface (already mentioned above I believe). I saw criticism on the datarray docs about pandas having ambiguous behavior for integer ticks-- my view is that you have ticks so you don't have to think about "where" things are in the data structure ;) But again datarray is a different story-- ticks not required! 3. Data alignment routines I think the fundamental data alignment routines in larry and pandas belong in NumPy. We're both creating an integer vector in Cython and passing that to ndarray.take. There is also the issue of missing data handling. We should spend a little time and decide on the API for these functions that will work for both libraries and probably write C implementations. Here's the Cython code I'm referring to (which isn't all that pretty, and makes assumptions guaranteed by other parts of pandas): http://code.google.com/p/pandas/source/browse/trunk/pandas/lib/src/reindex.pyx 4. Group-by routines Not necessarily related to DataArray but highly relevant to statistical data structures (Skipper made a comment about this at the BoF). Having core group by routines (see Travis's NEP: http://projects.scipy.org/numpy/browser/trunk/doc/neps/groupby_additions.rst which is not rendering correctly for me, download the RST) makes a lot of sense rather than have all of us implement our own things. Group-by basically comes down to solving two problems: assigning chunks of data to groups (using some kind of mapping or function), and doing something with those group assignments (like aggregating or transforming-- think like group means or standardizing / zscoring within group). Using Python dicts to store the group assignments computed by arbitrary functions (the way pandas does it now) is often suboptimal if you want to, say, group one ndarray by another-- I think in most cases we can do a lot better, but will be important to have a very "general" group-by where performance might be a little slower. ---- In any case-- if we can trim down the amount of duplicated logic between the various libraries, I think that would be a big win overall. I'm not sure if having "one data object to rule them all" is something we can achieve for the moment. pandas has been developed decidedly for statistics, econometrics, and finance which has led to some slightly domain-specific design choices. I am fairly certain there are a large number of users out there for whom these sort of tools could be hugely useful in making the switch to Python from R, Matlab, Java, C++, etc. - Wes _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion