On Thu, Jul 8, 2010 at 1:38 PM, Lluís <xscr...@gmx.net> wrote: > Skipper Seabold writes: > >> On Thu, Jul 8, 2010 at 12:02 PM, Rob Speer <rsp...@mit.edu> wrote: > [...] >>> My proposal is that datarray.row should be equivalent to >>> datarray.axes[0], and datarray.column should be equivalent to >>> datarray.axes[1], so that you can always ask for something like >>> "arr.column.named(2010)" (replace those with square brackets if you >>> like). >>> >>> Not sure yet what the right way is to generalize this to 1-D and n-D. > >> I think we have to start from the nD case, even if I (and I think most >> users) will tend to think in 2D. The rest is just going to have to be >> up to developers how they want users to interact with what we, the >> developers, see as axes. No end-user wants to think about the 6th >> axis of the data, but I don't want to be pegged into rows and columns >> thinking because I don't think it works for the below example. > > You could simply provide a subclass of datarray called 'table' that > automatically labels the two (mandatory) axis as 'column' and 'row'. > > > [...] >> city, month, year, region, precipitation, temperature >> "Austin", "January", 1980, "South", 12.1, 65.4, >> "Austin", "February", 1980, "South", 24.3, 55.4 >> "Austin", "March", 1980, "South", 3, 69.1 >> .... >> "Austin", "December", 2009, 1, 62.1 >> "Boston", "January", 1980, "Northeast", 1.5, 19.2 >> .... >> "Boston","December", 2009, "Northeast", 2.1, 23.5 >> ... >> "Memphis","January",1980, "South", 2.1, 35.6 >> ... >> "Memphis","December",2009, "South", 1.2, 33.5 >> ... > >> Sometimes, I want, say, to know what the average temperature is in >> December. Sometimes I want to know what the average temperature is in >> Memphis. Sometimes I want to know the average temperature in Memphis >> in December or in Memphis in 1985. If I do this with structured >> arrays, most group-by type operations are at best O(n). Really this >> isn't feasible. > > If I understood well, you could have 4 axes (assuming that an Axis can only > handle a single label/variable). > > a = DatArray(numpy.array([...], dtype = [("precipitation", float), > ("temperature", float)]), > (("city", ["Austin", ...]), > ("month", ["January"]), > ...)) > > Then, you can: > a.city.named("Memphis").month.named("December")["temperature"].mean() > a.city.named("Memphis").year.named(1985)["temperature"].mean() >
One question at this point, is if attribute access like this has to be coded in Python like recarrays currently? If so, what is the speed trade-off. > Or shorter: > a.named["Memphis","December"]["temperature"].mean() > a.named["Memphis",:,"1985"]["temperature"].mean() > Much prefer the shorter. I also prefer by to named, but this is for later... Ie., I'm thinking I want you to group my data by... then give me temperature. That way it's a little clearer why there are two sets of [], IMO. > This raises the problem of non-homogeneous measurements. For example, if you > had > only a few measurements for Austin, the rest would be just NaNs to make the > shape homogeneus. Of course. And I will very often have this case. For instance, I will have household survey data where each household has a certain id, but there are a different number of family members. > > I solved this in sciexp2 with (this is not the API, but translated into a > DatArray-like interface for clarity): > > a = Data(numpy.array([...], dtype = [("precipitation", float), > ("temperature", float)]), > (("measurement", "@c...@-@mo...@-@y...@-@region@", > [{"city": "Austin", "month": "January", "year": 1980, "region": > "South"}, > ...]))) I have no idea what that's supposed to do! What do you fill in the "missing" data with, NaNs? > > a.named[::"city == 'Memphis' && month == 'December'"]["temperature"].mean() > a.named[::"city == 'Memphis' && year == 1985"]["temperature"].mean() This makes sense. > > But of course, this represents a tradeoff between "wasted" space and speed. > The > internals are on the line of (using ordered dicts): > > { 'city' : { 'Memphis': set(<indexes with memphis>), > ... }, > 'month' : { 'December': set(<indexes with december>), > ... }, > ... } > > Which translates into: > > a[union( d['city']['Memphis'], d['month']['december'] )] > > There's a less optimized path that supports arbitrary expressions (less than, > more than or equal, etc.), but has a cost of O(n). Wouldn't this need to be supported in any case? > > >> An even more difficult question is what if I want descriptive >> statistics on the "region" variable? Ie., I want to know how many >> observations I have for each region. This one can wait, but is still >> important for doing statistics. > > This _should_ be: > > a.region.named("South").size Sounds ok. Skipper > > > Read you, > Lluis > > -- > "And it's much the same thing with knowledge, for whenever you learn > something new, the whole world becomes that much richer." > -- The Princess of Pure Reason, as told by Norton Juster in The Phantom > Tollbooth > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion@scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion