Re: [Numpy-discussion] datarray repositories have diverged
Oh, I'm apparently confusing people's github usernames. Sorry about that. Josh's branch (jesusabdullah/datarray) is indeed the one I branched from, not Lluis's (xscript/datarray), though I merged in changes from Lluis at one point. Does anyone know if it's possible to change the "forked from" location of my branch to be Fernando's branch? -- Rob On Fri, Oct 1, 2010 at 3:22 PM, Joshua Holbrook wrote: > One thing I'd like to throw out there is that I haven't really done > anything with my branch past maybe adding a gh-pages branch, and > probably won't be for a while, if at all. As it turns out, I have a > hard time concentrating on the intricacies of apis. >_< > > --Josh (jesusabdullah :E ) > > > On Fri, Oct 1, 2010 at 11:10 AM, Fernando Perez wrote: >> On Thu, Sep 30, 2010 at 9:41 AM, Rob Speer wrote: >>> >>> The way you'd usually get something merged in this kind of project is >>> to send a pull request to the leader using the "Pull Request" button. >>> But in this case, I'm basically making my pull request on the mailing >>> list, because it's not straightforward enough for a simple pull >>> request. >> >> I just wanted to reply temporarily to say that I'm *not* ignoring this >> discussion, despite appearances to the contrary :) In the next week >> we hope to put some time into this at work, and I'll try to catch up >> with the discussion tomorrow. >> >> One thing to note is that the new pull request system on GH is leaps >> and bounds better than the old. Now they get automatically an issue, >> a discussion page, a stable url, etc. So if anyone has anything on >> datarray that they feel is ready to pull, it would be great if you >> could click again on the pull request button (GH did not auto-migrate >> old pull requests to the new system, they need to be made again >> manually). >> >> And we'll do our best to hold our end of the bargain of collaborative >> development over the next few days :) >> >> Regards, >> >> f >> ___ >> NumPy-Discussion mailing list >> NumPy-Discussion@scipy.org >> http://mail.scipy.org/mailman/listinfo/numpy-discussion >> > ___ > NumPy-Discussion mailing list > NumPy-Discussion@scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] datarray repositories have diverged
The fact that I wasn't around for the sprint probably has a lot to do with how much the code had diverged. But it's not too bad -- I merged Fernando's branch into mine and only had to change a couple of things to make the tests pass. There seem to be two general patterns for decentralized projects on GitHub: either you have one de facto leader who owns what everyone considers the main branch (this is what datarray is doing now, with Fernando as the leader), or you create a GitHub "organization" that owns the main branch and make a bunch of key people members of the organization (which is what numpy is doing). The way you'd usually get something merged in this kind of project is to send a pull request to the leader using the "Pull Request" button. But in this case, I'm basically making my pull request on the mailing list, because it's not straightforward enough for a simple pull request. -- Rob On Thu, Sep 30, 2010 at 12:22 PM, Lluís wrote: > Rob Speer writes: > >> However, I notice that all the new development on datarray is >> happening on Fernando Perez's branch, which mine diverged from long >> ago. I forked from Lluis (jesusabdullah)'s branch, which was the most >> active at the time, and I got all but the most recent changes merged >> back in. But that branch in turn was never merged back into fperez's. > > Ups! I thought my master branch was obsolete after the first sprint, so > I deleted it and re-branched from fperez's. Thus, I suppose that > comparing against my current master won't be useful to you. > > BTW, my fix branches are incomplete (no tests and doc have been > updated), but in the future, how should they be merged (if they should > be)? I mean, should datarray fork from the new github numpy into a new > repository owned by a "datarray" user? I don't know much about how these > kind of things are managed on github, but I remember some comments about > that. > > apa! > > -- > "And it's much the same thing with knowledge, for whenever you learn > something new, the whole world becomes that much richer." > -- The Princess of Pure Reason, as told by Norton Juster in The Phantom > Tollbooth > ___ > NumPy-Discussion mailing list > NumPy-Discussion@scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] datarray repositories have diverged
There's some DataArray code that I've had for a while, but I just finished it up and tested it today. Most notably, it includes a new __str__ for DataArrays that does some nice layout and includes ticks: In [9]: print d_arr country year - - 1994 1998 2002 2006 2010 Netherlan -0.505758 0.096597 1.083148 -0.450156 0.172754 Uruguay1.772182 -0.113394 -0.781307 1.002416 -0.64925 Germany -2.013874 0.283947 1.170848 -0.504823 0.448497 Spain -0.725844 0.909713 -1.191371 -0.465167 -1.518764 (The layout functions in datarray.print_grid actually work with any ndarray, so you can use it as an alternative to the __str__ in NumPy.) However, I notice that all the new development on datarray is happening on Fernando Perez's branch, which mine diverged from long ago. I forked from Lluis (jesusabdullah)'s branch, which was the most active at the time, and I got all but the most recent changes merged back in. But that branch in turn was never merged back into fperez's. The divergence point is even before I added the syntax for named axes and ticks by name (such as arr.named['Spain', 1994:2010] or arr.year.named[2010]). Does it make sense to merge my branch into the main one? http://github.com/rspeer/datarray -- Rob ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] basic question about numpy arrays
To explain: A has shape (2,1), meaning it's a 2-D array with 2 rows and 1 column. The transpose of A has shape (1,2): it's a 2-D array with 1 row and 2 columns. That's not the same as what you want, which is an array with shape (2,): a 1-D array with 2 entries. When you take A[:,0], you're pulling out the 1-D array that constitutes column 0 of your array, which is exactly what you want. -- Rob ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Datarray BoF, part2
I agree with the idea that axis labels must be strings. Yes, this is the opposite of my position on tick labels ("names"), but there's a reason: ticks are often defined by whatever data you happen to be working with, but axis labels will in the vast majority of situations be defined by the programmer as they're writing the code. If the programmer wants to name something, they'll certainly be able to do so with a string. -- Rob On Wed, Jul 21, 2010 at 2:08 PM, Keith Goodman wrote: > On Wed, Jul 21, 2010 at 10:58 AM, M Trumpis wrote: > >> Separately, regarding the permissible axis labels, I think we must not >> allow any enumerated axis labels (ie, ints and floats). I don't >> remember if there was a consensus about that yesterday. We don't have >> the flexibility in the ndarray API to allow for the expression >> darr.method(axis=2) to mean not the 2nd dimension, but the Axis with >> label==2 > > So the axis label rule could be either: > > 1. str only > 2. Any hashable object except int or float > > #1 is looking better and better. Plus you already coded it :) > ___ > NumPy-Discussion mailing list > NumPy-Discussion@scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] BOF notes: Fernando's proposal: NumPy ndarray with named axes
> Not really. 1-D structured arrays can and do work well for the very > common case where one has unlabeled rows and labeled columns. They are > also a little bit more flexible in that the columns can be > heterogeneous in dtype, as columns are wont to do. > > May I politely suggest that, just as some people did not do a > sufficient job of reading the datarray proposal to understand how they > differ from structured arrays, you do not know as much about > structured arrays to understand the ways in which they are similar to > labeled arrays? Understanding both the similarities and differences is > important because both are going to be living in the same ecosystem > with overlapping niches. All right, you got me there. I had no idea that record arrays had that functionality. In the end, we are making the same point. Datarrays and record arrays take different approaches to similar problems, and can easily coexist, and you could even make a datarray of records if you wanted. -- Rob ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] BOF notes: Fernando's proposal: NumPy ndarray with named axes
rec['305'] extracts a single value from a single record. arr.named[:,305] extracts an *entire column* from a 2-D datarray, returning you a 1-D datarray. Once again, 1-D record arrays and 2-D labeled arrays look similar when you print them, but the data structures are so unrelated that there is really not much point in comparing them any further. -- Rob On Mon, Jul 12, 2010 at 6:01 PM, Neil Crighton wrote: > Rob Speer MIT.EDU> writes: > >> It's not just about the rows: a 2-D datarray can also index by >> columns, an operation that has no equivalent in a 1-D array of records >> like your example. > > rec['305'] effectively indexes by column. This is one the main attractions of > structured/record arrays. > > > > > ___ > NumPy-Discussion mailing list > NumPy-Discussion@scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] BOF notes: Fernando's proposal: NumPy ndarray with named axes
It's not just about the rows: a 2-D datarray can also index by columns, an operation that has no equivalent in a 1-D array of records like your example. In the movie example, arr.col_named(305) (or, in datarray syntax, arr.named[:,305], or arr.user.named[305]) contains the movie ratings for the user with ID 305, still indexed by movie titles. You can't do that at all with a record array of the form you described, except by using a list comprehension over the whole array that turns it into something else. 2-D datarrays and 1-D record arrays may look similar, but they are very different data structures. In fact, they're probably orthogonal to each other -- I see no reason one couldn't make a datarray of records, except for the fact that I wouldn't want to write the __str__ for such a beast. (Speaking of which, I'm working on a 2-D datarray __str__ based on the Divisi one. I have to make it support datatypes besides floats, however.) -- Rob On Sun, Jul 11, 2010 at 2:09 PM, Neil Crighton wrote: > Robert Kern gmail.com> writes: > >> >> On Sun, Jul 11, 2010 at 11:36, Rob Speer mit.edu> wrote: >> >> But the utility of named indices is not so clear >> >> to me. As I understand it, these new arrays will still only be >> >> able to have a single type of data (one of float, str, int and so >> >> on). This seems to be pretty limiting. >> >> Having ticks on *every* axis is the primary feature there. >> > > I see, thanks. > > So for Rob's example slide you could use a record array: > > rec = np.rec.fromrecords(data, names='name,305,6,234') > > (Here data is a list of tuples, each tuple giving the movie name + it's data.) > > In this case it's easy to index by field name (rec['205']), but a trickier to > choose the row using the movie name: > > ind = dict((n,i) for i,n in enumerate(rec.name)) > > rec[ind['Wrong Trousers, The (1993)']] > > So datarrays would make this easier. > > > > ___ > NumPy-Discussion mailing list > NumPy-Discussion@scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] BOF notes: Fernando's proposal: NumPy ndarray with named axes
> But the utility of named indices is not so clear > to me. As I understand it, these new arrays will still only be > able to have a single type of data (one of float, str, int and so > on). This seems to be pretty limiting. This just shows that people use NumPy for lots of different things. I myself have never understood what record arrays are for -- what's the advantage of using NumPy if your data isn't made of numbers? If you've ever used a dictionary instead of a list, then you have seen the utility of named indices. I've got an example of labeled data in a matrix, starting at slide 10 of http://csc.media.mit.edu/docs/_static/divisi_slides.pdf. -- Rob ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] BOF notes: Fernando's proposal: NumPy ndarray with named axes
Keith Goodman wrote: > I ran into a few more questions while playing with datarrays, so I started a > list: > http://github.com/kwgoodman/datarrayQ I have quick answers to some of the questions. > Can I have ticks without labels? Ideally, yes, but good catch: the current code disallows that for no good reason. > Add a ticks input parameter? I very much approve of this proposal (to add ticks= to define ticks separately from axes). > Create Axis._tick_dict when needed? Wait, the dictionary wouldn't be saved at all? What's the point, then? Constant-time lookups of tick names are essential, and this proposal would turn that into linear time. > Can we prevent user from messing up a datarray? No. That's pretty much built into Python: the downstream user can do anything they want to. Our job is to make sure that what the user wants to do is use the datarray correctly. :) > 0d datarrays? As 0d datarrays are completely pointless, I'm pretty sure that any code that creates a 0d datarray is a mistake and should fail early. > Can axis labels be anything besides None or str? Possibly. The part of this question I particularly like is accessing attributes programmatically, using arr.axis[axisname]. That gives .axis much more of a purpose. (Follow-up question: should we merge .axis and .axes in the API?) > Direct access to array? It's trivial: DataArray is a subclass of ndarray, so a DataArray already is an ndarray. If you want to strip off all the datarray stuff anyway (perhaps for efficiency reasons), you can use np.asarray(arr). > Support for alignment? Very yes. Aligning/joining labels is something that basically everyone who works with labeled data needs to do, so we should figure out the logic for it and include it in datarray so downstream users don't have to reinvent it. > Can labels and ticks be changed? I'd favor them being immutable, but could have my mind changed by a good use case for mutating them. -- Rob ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] BOF notes: Fernando's proposal: NumPy ndarray with named axes
Now, the one part I've implemented that I just made up instead of looking to the SciPy consensus (because there was no SciPy consensus) was how to refer to multiple labeled axes without repeating ".axis" all over the place. My choice, which I call "magical axis attributes", is to have arr.somelabel == arr.axis.somelabel whenever it doesn't mean something else. This turns the call arr.axis.country.named['Netherlands'].axis.year[-1] into: arr.country.named['Netherlands'].year[-1] I got a message from Fernando Perez saying that he didn't like the magical axis attributes, for the expected reason that it's inconsistent. You shouldn't have to refer to your axis differently just because you called it something like "mean". Another problem that just occurred to me is that datarray-using code could break just because DataArray, or even ndarray itself, grew a new method. I like the syntax that magical attributes provide, but I'm willing to consider other options. Here's one: The __getattr__ only does its magic on attribute names that end in "_index" or "_named", which should not conflict with other method names. "arr.foo_index[3]" is the same as "arr.axis.foo[3]". Furthermore, "arr.foo_named['bar']" is the same as "arr.axis.foo.named['bar']". Then the above lookup becomes: arr.country_named['Netherlands'].year_index[-1] I don't find this as appealing as magical attributes, but perhaps it's more responsible. I'd like to know what other people think, so let me summarize and name the existing proposals: arr.axis.country.named['Netherlands'].axis.year[-1] # the default option -- works in any case arr[ arr.aix.country.named['Netherlands'].year[-1] ] # the "stuple" option arr.country.named['Netherlands'].year[-1] # the "magical" option arr.country_named['Netherlands'].year_index[-1]# the "semi-magical" option -- Rob On Fri, Jul 9, 2010 at 1:39 AM, Rob Speer wrote: > http://github.com/rspeer/datarray represents my best guess at the > SciPy BOF consensus. I recently switched the method of accessing named > ticks from .named() to .named[] based on further discussion here. > > My implementation is still missing the case with named ticks but > positional axes, however. That is, you should be able to use .named > directly on the top-level datarray without referring to any axis > labels, to say something like arr.named['Netherlands', 2010], but you > can't yet. > -- Rob > > On Thu, Jul 8, 2010 at 11:44 PM, Keith Goodman wrote: >> On Thu, Jul 8, 2010 at 1:20 PM, Fernando Perez wrote: >> >>> The consensus at the BoF (not that it means it's set in stone, simply >>> that there was good chance for back-and-forth on the topic with many >>> voices) was that: >>> >>> 1. There are valid use cases for 'integer ticks', i.e. integers that >>> index arbitrarily into an array instead of in 0..N-1 fashion. >>> >>> 2. That having plain arr[0] give anything but the first element in arr >>> would be way too confusing in practice, and likely to cause too many >>> problems. >>> >>> 3. That the best solution to allow integer ticks while retaining >>> 'normal' indexing semantics for integers would be to have >>> >>> arr[int] -> normal indexing >>> arr.somethin[int] -> tick-based indexing, where an int can mean anything. >> >> Has the Scipy 2010 BOF consensus been implemented in anyone's fork? I >> don't understand the indexing so I'd like to try it. >> ___ >> NumPy-Discussion mailing list >> NumPy-Discussion@scipy.org >> http://mail.scipy.org/mailman/listinfo/numpy-discussion >> > ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] BOF notes: Fernando's proposal: NumPy ndarray with named axes
http://github.com/rspeer/datarray represents my best guess at the SciPy BOF consensus. I recently switched the method of accessing named ticks from .named() to .named[] based on further discussion here. My implementation is still missing the case with named ticks but positional axes, however. That is, you should be able to use .named directly on the top-level datarray without referring to any axis labels, to say something like arr.named['Netherlands', 2010], but you can't yet. -- Rob On Thu, Jul 8, 2010 at 11:44 PM, Keith Goodman wrote: > On Thu, Jul 8, 2010 at 1:20 PM, Fernando Perez wrote: > >> The consensus at the BoF (not that it means it's set in stone, simply >> that there was good chance for back-and-forth on the topic with many >> voices) was that: >> >> 1. There are valid use cases for 'integer ticks', i.e. integers that >> index arbitrarily into an array instead of in 0..N-1 fashion. >> >> 2. That having plain arr[0] give anything but the first element in arr >> would be way too confusing in practice, and likely to cause too many >> problems. >> >> 3. That the best solution to allow integer ticks while retaining >> 'normal' indexing semantics for integers would be to have >> >> arr[int] -> normal indexing >> arr.somethin[int] -> tick-based indexing, where an int can mean anything. > > Has the Scipy 2010 BOF consensus been implemented in anyone's fork? I > don't understand the indexing so I'd like to try it. > ___ > NumPy-Discussion mailing list > NumPy-Discussion@scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] BOF notes: Fernando's proposal: NumPy ndarray with named axes
> I think we have to start from the nD case, even if I (and I think most > users) will tend to think in 2D. The rest is just going to have to be > up to developers how they want users to interact with what we, the > developers, see as axes. No end-user wants to think about the 6th > axis of the data, but I don't want to be pegged into rows and columns > thinking because I don't think it works for the below example. In a lot of tasks, you can get by with just two axes of data, and it is intuitive to refer to them as "rows" and "columns". If you have more than two axes, then ideally you give them particular labels, or otherwise you can still call them axes[0], axes[1], axes[2], etc. But the question is, if we have convenient names like this for matrices, should .row on a 3-d array raise an error, or should it give you axes[0] even though that's more like a plane than a row? -- Rob ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] BOF notes: Fernando's proposal: NumPy ndarray with named axes
> 3. That the best solution to allow integer ticks while retaining > 'normal' indexing semantics for integers would be to have > > arr[int] -> normal indexing > arr.somethin[int] -> tick-based indexing, where an int can mean anything. All right, it's clear lots of people like it better this way, so I made arr.named use square brackets instead of parentheses. Before: >>> arr.country.named('Spain').year.named(slice(1994, 2010)) After: >>> arr.country.named['Spain'].year.named[1994:2010] -- Rob ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] BOF notes: Fernando's proposal: NumPy ndarray with named axes
>> Still, I have a question. Did you also agree that this should forcibly index >> through ticks? >> >> arr.something[int] -> tick-based indexing >> > > Yes. I feel like people are talking about different things because it's unclear what the .something is. If the .something is an axis name, then no. arr.year[0] should get the first year in the data, not the data from the "year 0". If the .something is the attribute we use for named lookup (such as ".named"), then yes. arr.named[2006] should get whatever tick is named 2006 on the first axis. -- Rob ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] BOF notes: Fernando's proposal: NumPy ndarray with named axes
> No. I'd rather go for eliminating the 'arr.year.named', and providing only: > * arr.__getitem__ > * arr.named.__getitem__ > * arr..__getitem__ > > The first being just the current ndarray.__getitem__, and the two last methods > would accept both strings and integers, assuming that names/ticks based on > integers (e.g., the 1994 above) must be provided as strings, or otherwise are > treated as good old array indexes. There are lots of data types besides strings that make good names (tuples, for example). My impression from SciPy was that people would prefer separate accessors for names and indices, especially because integers (a really common data type, after all) shouldn't be forbidden. Also, working with strings of integers like '2010' makes me feel like I'm using PHP, a feeling I like to avoid whenever possible. :) -- Rob ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] BOF notes: Fernando's proposal: NumPy ndarray with named axes
On Thu, Jul 8, 2010 at 2:27 PM, Skipper Seabold wrote: > On Thu, Jul 8, 2010 at 1:35 PM, Rob Speer wrote: >> Your labels are unique if you look at them the right way. Here's how I >> would represent that in a datarray: >> * axis0 = 'city', ['Austin', 'Boston', ...] >> * axis1 = 'month', ['January', 'February', ...] >> * axis2 = 'year', [1980, 1981, ...] >> * axis3 = 'region', ['Northeast', 'South', ...] >> * axis4 = 'measurement', ['precipitation', 'temperature'] >> >> and then I'd make a 5-D datarray labeled with [axis0, axis1, axis2, >> axis3, axis4]. >> > > Yeah, this is what I was thinking I would have to do, but it's still > not clear to me (I have trouble trying to think in 5 dimensions...). > For instance, what axis holds my actual numeric data? > > axis4, with a "precipitation" tick? Yep, that's what I was suggesting. Or you could have two different 4-D matrices, one whose values are precipitation and one whose values are temperatures. >> Now I realize not everyone wants to represent their tabular data as a >> big tensor that they index every which way, and I think this is one >> thing that pandas is for. > > This is kind of where I would like the divide to be between user and > developer. On top of all of this, I would like to see a __repr__ or > something that actually spits out a 2d spreadsheet-looking > representation. It would help me stay sane I think. Fernando's nice > 3D graphic only can go so far as a mental model (for me at least). Divisi2 uses a 2D labeled representation as its __str__ -- an example is at http://csc.media.mit.edu/docs/divisi2/sparse.html I could port this onto datarray. I was holding off because I was unsure about how to represent the N-d case, but I realize now that showing the entries in this kind of 2-D tabular format could actually be a really intuitive way to do it. > Mix-ins sounds reasonable to me as long as this could easily be > accomplished. Ie., why use csr? Can you go between others? Are the > sparse matrices reasonably stable given recent activity? Not > rhetorical questions, I don't use sparse matrices much. These are good questions. I ended up using PySparse instead of scipy.sparse, because SciPy 0.7's sparse matrices weren't ready to support many important operations, particularly slicing. SciPy 0.8's sparse matrices look much better, and I may transition to using them once it's released. When planning future features of NumPy, of course, we should assume SciPy's sparse matrices do what we want (and possibly fix them if they don't). csr_matrix was just an example. I think there would have to be separate classes for labeled csr_matrices, labeled lil_matrices, and so on, supporting all the usual methods for converting between them. -- Rob ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] BOF notes: Fernando's proposal: NumPy ndarray with named axes
>> But I don't understand your second example: >>> arr.country['Spain'].year[1994:2010] > >> That seems to run straight into the index/name ambiguity. Shouldn't >> that take the 1994th through 2010th indices along the "year" axis? Not >> every axis will have names, so you can't make *all* the indexing go by >> names. > > Sorry, I just c&p without placing the necessary '. > >> If named were a getitem-able object, that would be: > arr.country.named['Spain'].year.named[1994:2010] > > Or what I was striving for: > > arr.year.named[1994:2010] > arr.year['1994':'2010'] > arr.year['1994':-3] So your proposal is, whenever there's an index that is not an integer, look it up by name, and use .named only if you want integer tick names? This feels too inconsistent to me. It adds a fair amount of confusion to save a small amount of typing. If keystrokes are that important, I'd rather replace "named" with something shorter than lose the distinction entirely. -- Rob ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] BOF notes: Fernando's proposal: NumPy ndarray with named axes
> Forgive me if this is has already been addressed, but my question is > what happens when we have more than one "label" (not as in a labeled > axis but an observation label -- but not a tick because they're not > unique!) per say row axis and heterogenous dtypes. This is really the > problem that I would like to see addressed and from the BoF comments > I'm not sure this use case is going to be covered. I'm also not sure > I expressed myself clearly enough or understood what's already > available. For me, this is the single most common use case and most > of what we are talking about now is just convenient slicing but > ignoring some basic and prominent concerns. Please correct me if I'm > wrong. I need to play more with DataArray implementation but haven't > had time yet. > > I often have data that looks like this (not really, but it gives the > idea in a general way I think). > > city, month, year, region, precipitation, temperature > "Austin", "January", 1980, "South", 12.1, 65.4, > "Austin", "February", 1980, "South", 24.3, 55.4 > "Austin", "March", 1980, "South", 3, 69.1 > > "Austin", "December", 2009, 1, 62.1 > "Boston", "January", 1980, "Northeast", 1.5, 19.2 > > "Boston","December", 2009, "Northeast", 2.1, 23.5 > ... > "Memphis","January",1980, "South", 2.1, 35.6 > ... > "Memphis","December",2009, "South", 1.2, 33.5 > ... Your labels are unique if you look at them the right way. Here's how I would represent that in a datarray: * axis0 = 'city', ['Austin', 'Boston', ...] * axis1 = 'month', ['January', 'February', ...] * axis2 = 'year', [1980, 1981, ...] * axis3 = 'region', ['Northeast', 'South', ...] * axis4 = 'measurement', ['precipitation', 'temperature'] and then I'd make a 5-D datarray labeled with [axis0, axis1, axis2, axis3, axis4]. Now I realize not everyone wants to represent their tabular data as a big tensor that they index every which way, and I think this is one thing that pandas is for. Oh, and the other problem with the 5-D datarray is that you'd probably want it to be sparse. This is another discussion worth having. I want to eventually replace the labeling stuff in Divisi with datarray, but sparse matrices are largely the point of using Divisi. So how do we make a sparse datarray? One answer would be to have datarray be a wrapper that encapsulates any sufficiently matrix-like type. This is approximately what I did in the now-obsolete Divisi1. Nobody liked the fact that you had to wrap and unwrap your arrays to accomplish anything that we hadn't thought of in writing Divisi. I would not recommend this route. The other option, which is more like Divisi2. would be to provide the functionality of datarray using a mixin. Then a standard dense datarray could inherit from (np.ndarray, Datarray), while a sparse datarray could inherit from (sparse.csr_matrix, Datarray), for example. -- Rob ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] BOF notes: Fernando's proposal: NumPy ndarray with named axes
> While I haven't had a chance to really look in-depth at the changes > myself (I'm a busy man! So many mailing lists!), I so far like the > look and sound of them. That's just my opinion, though. If people are okay with the attribute magic, I have a proposal for more of it. In my own project where I use labeled arrays (http://github.com/commonsense/divisi2), I don't have labeled axes. But I assumed everything was 1 or 2-D, and gave the 2-D matrices methods like "row_named", "col_named", etc., to encourage readable code. With the current implementation of datarray, I could get that by labeling the axes "row" and "col", except the moment you transpose a matrix like that you get rows named "col" and columns named "row", so that's not the right answer. My proposal is that datarray.row should be equivalent to datarray.axes[0], and datarray.column should be equivalent to datarray.axes[1], so that you can always ask for something like "arr.column.named(2010)" (replace those with square brackets if you like). Not sure yet what the right way is to generalize this to 1-D and n-D. -- Rob ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] BOF notes: Fernando's proposal: NumPy ndarray with named axes
On Thu, Jul 8, 2010 at 7:13 AM, Lluís wrote: > Thus, we can use something in the middle: > > arr[0,1] > arr.names['Netherlands',2010] # I'd rather go for 'names' instead of 'ticks' Ah ha. So this is the case with positional axes but named ticks, which we haven't really brought up yet. I'm definitely thinking of making the top-level datarray support "named" as well, which would make it into: >>> arr.named('Netherlands', 2010) But the other change you've got here is to make "named" into a __getitem__-able object instead of a method, so you use square brackets with it and can use slice syntax. I could do it this way as well. But I don't understand your second example: > arr.country['Spain'].year[1994:2010] That seems to run straight into the index/name ambiguity. Shouldn't that take the 1994th through 2010th indices along the "year" axis? Not every axis will have names, so you can't make *all* the indexing go by names. If named were a getitem-able object, that would be: >>> arr.country.named['Spain'].year.named[1994:2010] -- Rob ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] BOF notes: Fernando's proposal: NumPy ndarray with named axes
Glad I finally found this discussion. I implemented some of the ideas from the SciPy BOAF discussion, and Joshua has already merged them into his datarray on GitHub (thanks, Joshua, for being so fast on the merge button). To introduce these changes, here's a couple of examples of how you could index into a matrix whose rows represent countries, and whose columns represent something that is observed every four years (hmm...). >>> arr.country.named('Netherlands').year.named(2010) >>> arr.country.named('Spain').year.named(slice(1994, 2010)) >>> arr.year.named(2006).country[0:2] First of all, a bit of terminology. Axes can have labels. Ticks (which are particular rows, columns, etc.) can have names. Axes and ticks also have indices (the sequential numbers they've always had). Feel free to suggest alternate terminology, I just used what sounded the most natural to me in the method names. Addressing by indices and addressing by tick names are separate, which allows integers to be tick names without a conflict. You use the "named" method of an axis to address it by name, while __getitem__ only addresses it by indices. You can still take slices of names (makes sense for things like years), but you have to spell out "slice" because it's not inside square brackets. Then, at the axis level: My impression from the SciPy discussion was that people wanted to be able to look up multiple labeled axes at once without repeating themselves, and .aix and stuples were not satisfying, but we didn't come up with anything else during the discussion. My choice was to add a bit of attribute magic: if you get an attribute of a datarray that is (a) not a real attribute and (b) matches the label of one of its axes, you'll get that axis. So "arr.axis.country" can be shortened to "arr.country", for example, but if you decided to name your axis "T", you would be stuck with "arr.axis.T". So this is the state of the code at http://github.com/rspeer/datarray (and also at http://github.com/jesusabdullah/datarray now). I'll even try to make the documentation catch up with this code if people think the changes are good. -- Rob ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion