Re: [Numpy-discussion] datarray repositories have diverged

2010-10-01 Thread Rob Speer
Oh, I'm apparently confusing people's github usernames. Sorry about that.

Josh's branch (jesusabdullah/datarray) is indeed the one I branched
from, not Lluis's (xscript/datarray), though I merged in changes from
Lluis at one point.

Does anyone know if it's possible to change the "forked from" location
of my branch to be Fernando's branch?
-- Rob

On Fri, Oct 1, 2010 at 3:22 PM, Joshua Holbrook  wrote:
> One thing I'd like to throw out there is that I haven't really done
> anything with my branch past maybe adding a gh-pages branch, and
> probably won't be for a while, if at all. As it turns out, I have a
> hard time concentrating on the intricacies of apis. >_<
>
> --Josh (jesusabdullah :E )
>
>
> On Fri, Oct 1, 2010 at 11:10 AM, Fernando Perez  wrote:
>> On Thu, Sep 30, 2010 at 9:41 AM, Rob Speer  wrote:
>>>
>>> The way you'd usually get something merged in this kind of project is
>>> to send a pull request to the leader using the "Pull Request" button.
>>> But in this case, I'm basically making my pull request on the mailing
>>> list, because it's not straightforward enough for a simple pull
>>> request.
>>
>> I just wanted to reply temporarily to say that I'm *not* ignoring this
>> discussion, despite appearances to the contrary :)  In the next week
>> we hope to put some time into this at work, and I'll try to catch up
>> with the discussion tomorrow.
>>
>> One thing to note is that the new pull request system on GH is leaps
>> and bounds better than the old.  Now they get automatically an issue,
>> a discussion page, a stable url, etc.  So if anyone has anything on
>> datarray that they feel is ready to pull, it would be great if you
>> could click again on the pull request button (GH did not auto-migrate
>> old pull requests to the new system, they need to be made again
>> manually).
>>
>> And we'll do our best to hold our end of the bargain of collaborative
>> development over the next few days :)
>>
>> Regards,
>>
>> f
>> ___
>> NumPy-Discussion mailing list
>> NumPy-Discussion@scipy.org
>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>>
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] datarray repositories have diverged

2010-09-30 Thread Rob Speer
The fact that I wasn't around for the sprint probably has a lot to do
with how much the code had diverged. But it's not too bad -- I merged
Fernando's branch into mine and only had to change a couple of things
to make the tests pass.

There seem to be two general patterns for decentralized projects on
GitHub: either you have one de facto leader who owns what everyone
considers the main branch (this is what datarray is doing now, with
Fernando as the leader), or you create a GitHub "organization" that
owns the main branch and make a bunch of key people members of the
organization (which is what numpy is doing).

The way you'd usually get something merged in this kind of project is
to send a pull request to the leader using the "Pull Request" button.
But in this case, I'm basically making my pull request on the mailing
list, because it's not straightforward enough for a simple pull
request.

-- Rob

On Thu, Sep 30, 2010 at 12:22 PM, Lluís  wrote:
> Rob Speer writes:
>
>> However, I notice that all the new development on datarray is
>> happening on Fernando Perez's branch, which mine diverged from long
>> ago. I forked from Lluis (jesusabdullah)'s branch, which was the most
>> active at the time, and I got all but the most recent changes merged
>> back in. But that branch in turn was never merged back into fperez's.
>
> Ups! I thought my master branch was obsolete after the first sprint, so
> I deleted it and re-branched from fperez's. Thus, I suppose that
> comparing against my current master won't be useful to you.
>
> BTW, my fix branches are incomplete (no tests and doc have been
> updated), but in the future, how should they be merged (if they should
> be)?  I mean, should datarray fork from the new github numpy into a new
> repository owned by a "datarray" user? I don't know much about how these
> kind of things are managed on github, but I remember some comments about
> that.
>
> apa!
>
> --
>  "And it's much the same thing with knowledge, for whenever you learn
>  something new, the whole world becomes that much richer."
>  -- The Princess of Pure Reason, as told by Norton Juster in The Phantom
>  Tollbooth
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] datarray repositories have diverged

2010-09-30 Thread Rob Speer
There's some DataArray code that I've had for a while, but I just
finished it up and tested it today. Most notably, it includes a new
__str__ for DataArrays that does some nice layout and includes ticks:

In [9]: print d_arr
country   year
- -
  1994  1998  2002  2006  2010
Netherlan -0.505758  0.096597  1.083148 -0.450156  0.172754
Uruguay1.772182 -0.113394 -0.781307  1.002416 -0.64925
Germany   -2.013874  0.283947  1.170848 -0.504823  0.448497
Spain -0.725844  0.909713 -1.191371 -0.465167 -1.518764

(The layout functions in datarray.print_grid actually work with any
ndarray, so you can use it as an alternative to the __str__ in NumPy.)

However, I notice that all the new development on datarray is
happening on Fernando Perez's branch, which mine diverged from long
ago. I forked from Lluis (jesusabdullah)'s branch, which was the most
active at the time, and I got all but the most recent changes merged
back in. But that branch in turn was never merged back into fperez's.

The divergence point is even before I added the syntax for named axes
and ticks by name (such as arr.named['Spain', 1994:2010] or
arr.year.named[2010]).

Does it make sense to merge my branch into the main one?

http://github.com/rspeer/datarray
-- Rob
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] basic question about numpy arrays

2010-08-16 Thread Rob Speer
To explain:

A has shape (2,1), meaning it's a 2-D array with 2 rows and 1 column.
The transpose of A has shape (1,2): it's a 2-D array with 1 row and 2
columns. That's not the same as what you want, which is an array with
shape (2,): a 1-D array with 2 entries.

When you take A[:,0], you're pulling out the 1-D array that
constitutes column 0 of your array, which is exactly what you want.

-- Rob
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Datarray BoF, part2

2010-07-21 Thread Rob Speer
I agree with the idea that axis labels must be strings.

Yes, this is the opposite of my position on tick labels ("names"), but
there's a reason: ticks are often defined by whatever data you happen
to be working with, but axis labels will in the vast majority of
situations be defined by the programmer as they're writing the code.
If the programmer wants to name something, they'll certainly be able
to do so with a string.

-- Rob

On Wed, Jul 21, 2010 at 2:08 PM, Keith Goodman  wrote:
> On Wed, Jul 21, 2010 at 10:58 AM, M Trumpis  wrote:
>
>> Separately, regarding the permissible axis labels, I think we must not
>> allow any enumerated axis labels (ie, ints and floats). I don't
>> remember if there was a consensus about that yesterday. We don't have
>> the flexibility in the ndarray API to allow for the expression
>> darr.method(axis=2) to mean not the 2nd dimension, but the Axis with
>> label==2
>
> So the axis label rule could be either:
>
> 1. str only
> 2. Any hashable object except int or float
>
> #1 is looking better and better. Plus you already coded it :)
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] BOF notes: Fernando's proposal: NumPy ndarray with named axes

2010-07-13 Thread Rob Speer
> Not really. 1-D structured arrays can and do work well for the very
> common case where one has unlabeled rows and labeled columns. They are
> also a little bit more flexible in that the columns can be
> heterogeneous in dtype, as columns are wont to do.
>
> May I politely suggest that, just as some people did not do a
> sufficient job of reading the datarray proposal to understand how they
> differ from structured arrays, you do not know as much about
> structured arrays to understand the ways in which they are similar to
> labeled arrays? Understanding both the similarities and differences is
> important because both are going to be living in the same ecosystem
> with overlapping niches.

All right, you got me there. I had no idea that record arrays had that
functionality.

In the end, we are making the same point. Datarrays and record arrays
take different approaches to similar problems, and can easily coexist,
and you could even make a datarray of records if you wanted.
-- Rob
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] BOF notes: Fernando's proposal: NumPy ndarray with named axes

2010-07-12 Thread Rob Speer
rec['305'] extracts a single value from a single record.
arr.named[:,305] extracts an *entire column* from a 2-D datarray,
returning you a 1-D datarray.

Once again, 1-D record arrays and 2-D labeled arrays look similar when
you print them, but the data structures are so unrelated that there is
really not much point in comparing them any further.
-- Rob

On Mon, Jul 12, 2010 at 6:01 PM, Neil Crighton  wrote:
> Rob Speer  MIT.EDU> writes:
>
>> It's not just about the rows: a 2-D datarray can also index by
>> columns, an operation that has no equivalent in a 1-D array of records
>> like your example.
>
> rec['305'] effectively indexes by column. This is one the main attractions of
> structured/record arrays.
>
>
>
>
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] BOF notes: Fernando's proposal: NumPy ndarray with named axes

2010-07-12 Thread Rob Speer
It's not just about the rows: a 2-D datarray can also index by
columns, an operation that has no equivalent in a 1-D array of records
like your example.

In the movie example, arr.col_named(305) (or, in datarray syntax,
arr.named[:,305], or arr.user.named[305]) contains the movie ratings
for the user with ID 305, still indexed by movie titles. You can't do
that at all with a record array of the form you described, except by
using a list comprehension over the whole array that turns it into
something else.

2-D datarrays and 1-D record arrays may look similar, but they are
very different data structures. In fact, they're probably orthogonal
to each other -- I see no reason one couldn't make a datarray of
records, except for the fact that I wouldn't want to write the __str__
for such a beast.

(Speaking of which, I'm working on a 2-D datarray __str__ based on the
Divisi one. I have to make it support datatypes besides floats,
however.)
-- Rob

On Sun, Jul 11, 2010 at 2:09 PM, Neil Crighton  wrote:
> Robert Kern  gmail.com> writes:
>
>>
>> On Sun, Jul 11, 2010 at 11:36, Rob Speer  mit.edu> wrote:
>> >> But the utility of named indices is not so clear
>> >> to me. As I understand it, these new arrays will still only be
>> >> able to have a single type of data (one of float, str, int and so
>> >> on). This seems to be pretty limiting.
>>
>> Having ticks on *every* axis is the primary feature there.
>>
>
> I see, thanks.
>
> So for Rob's example slide you could use a record array:
>
> rec = np.rec.fromrecords(data, names='name,305,6,234')
>
> (Here data is a list of tuples, each tuple giving the movie name + it's data.)
>
> In this case it's easy to index by field name (rec['205']), but a trickier to
> choose the row using the movie name:
>
> ind = dict((n,i) for i,n in enumerate(rec.name))
>
> rec[ind['Wrong Trousers, The (1993)']]
>
> So datarrays would make this easier.
>
>
>
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] BOF notes: Fernando's proposal: NumPy ndarray with named axes

2010-07-11 Thread Rob Speer
> But the utility of named indices is not so clear
> to me. As I understand it, these new arrays will still only be
> able to have a single type of data (one of float, str, int and so
> on). This seems to be pretty limiting.

This just shows that people use NumPy for lots of different things. I
myself have never understood what record arrays are for -- what's the
advantage of using NumPy if your data isn't made of numbers?

If you've ever used a dictionary instead of a list, then you have seen
the utility of named indices. I've got an example of labeled data in a
matrix, starting at slide 10 of
http://csc.media.mit.edu/docs/_static/divisi_slides.pdf.
-- Rob
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] BOF notes: Fernando's proposal: NumPy ndarray with named axes

2010-07-09 Thread Rob Speer
Keith Goodman wrote:
> I ran into a few more questions while playing with datarrays, so I started a 
> list:
> http://github.com/kwgoodman/datarrayQ

I have quick answers to some of the questions.

> Can I have ticks without labels?
Ideally, yes, but good catch: the current code disallows that for no
good reason.

> Add a ticks input parameter?
I very much approve of this proposal (to add ticks= to define ticks
separately from axes).

> Create Axis._tick_dict when needed?
Wait, the dictionary wouldn't be saved at all? What's the point, then?
Constant-time lookups of tick names are essential, and this proposal
would turn that into linear time.

> Can we prevent user from messing up a datarray?
No. That's pretty much built into Python: the downstream user can do
anything they want to.
Our job is to make sure that what the user wants to do is use the
datarray correctly. :)

> 0d datarrays?
As 0d datarrays are completely pointless, I'm pretty sure that any
code that creates a 0d datarray is a mistake and should fail early.

> Can axis labels be anything besides None or str?
Possibly. The part of this question I particularly like is accessing
attributes programmatically, using arr.axis[axisname]. That gives
.axis much more of a purpose. (Follow-up question: should we merge
.axis and .axes in the API?)

> Direct access to array?
It's trivial: DataArray is a subclass of ndarray, so a DataArray
already is an ndarray. If you want to strip off all the datarray stuff
anyway (perhaps for efficiency reasons), you can use np.asarray(arr).

> Support for alignment?
Very yes. Aligning/joining labels is something that basically everyone
who works with labeled data needs to do, so we should figure out the
logic for it and include it in datarray so downstream users don't have
to reinvent it.

> Can labels and ticks be changed?
I'd favor them being immutable, but could have my mind changed by a
good use case for mutating them.

-- Rob
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] BOF notes: Fernando's proposal: NumPy ndarray with named axes

2010-07-09 Thread Rob Speer
Now, the one part I've implemented that I just made up instead of
looking to the SciPy consensus (because there was no SciPy consensus)
was how to refer to multiple labeled axes without repeating ".axis"
all over the place. My choice, which I call "magical axis attributes",
is to have arr.somelabel == arr.axis.somelabel whenever it doesn't
mean something else. This turns the call
  arr.axis.country.named['Netherlands'].axis.year[-1]
into:
  arr.country.named['Netherlands'].year[-1]

I got a message from Fernando Perez saying that he didn't like the
magical axis attributes, for the expected reason that it's
inconsistent. You shouldn't have to refer to your axis differently
just because you called it something like "mean". Another problem that
just occurred to me is that
datarray-using code could break just because DataArray, or even
ndarray itself, grew a new method.

I like the syntax that magical attributes provide, but I'm willing to
consider other options. Here's one:

The __getattr__ only does its magic on attribute names that end in
"_index" or "_named", which should not conflict with other method
names. "arr.foo_index[3]" is the same as "arr.axis.foo[3]".
Furthermore, "arr.foo_named['bar']" is the same as
"arr.axis.foo.named['bar']". Then the above lookup becomes:
  arr.country_named['Netherlands'].year_index[-1]

I don't find this as appealing as magical attributes, but perhaps it's
more responsible. I'd like to know what other people think, so let me
summarize and name the existing proposals:

arr.axis.country.named['Netherlands'].axis.year[-1]   # the default
option -- works in any case
arr[ arr.aix.country.named['Netherlands'].year[-1] ]   # the "stuple" option
arr.country.named['Netherlands'].year[-1]  # the
"magical" option
arr.country_named['Netherlands'].year_index[-1]# the "semi-magical" option

-- Rob

On Fri, Jul 9, 2010 at 1:39 AM, Rob Speer  wrote:
> http://github.com/rspeer/datarray represents my best guess at the
> SciPy BOF consensus. I recently switched the method of accessing named
> ticks from .named() to .named[] based on further discussion here.
>
> My implementation is still missing the case with named ticks but
> positional axes, however. That is, you should be able to use .named
> directly on the top-level datarray without referring to any axis
> labels, to say something like arr.named['Netherlands', 2010], but you
> can't yet.
> -- Rob
>
> On Thu, Jul 8, 2010 at 11:44 PM, Keith Goodman  wrote:
>> On Thu, Jul 8, 2010 at 1:20 PM, Fernando Perez  wrote:
>>
>>> The consensus at the  BoF (not that it means it's set in stone, simply
>>> that there was  good chance for back-and-forth on the topic with many
>>> voices) was that:
>>>
>>> 1. There are valid use cases for 'integer ticks',  i.e. integers that
>>> index arbitrarily into an  array instead of in 0..N-1 fashion.
>>>
>>> 2. That having plain arr[0] give anything but the first element in arr
>>> would be way too confusing in practice, and likely to cause too many
>>> problems.
>>>
>>> 3. That the  best solution to allow integer ticks while retaining
>>> 'normal' indexing semantics for integers would be to have
>>>
>>> arr[int] -> normal indexing
>>> arr.somethin[int] -> tick-based indexing, where an int can mean anything.
>>
>> Has the Scipy 2010 BOF consensus been implemented in anyone's fork? I
>> don't understand the indexing so I'd like to try it.
>> ___
>> NumPy-Discussion mailing list
>> NumPy-Discussion@scipy.org
>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>>
>
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] BOF notes: Fernando's proposal: NumPy ndarray with named axes

2010-07-08 Thread Rob Speer
http://github.com/rspeer/datarray represents my best guess at the
SciPy BOF consensus. I recently switched the method of accessing named
ticks from .named() to .named[] based on further discussion here.

My implementation is still missing the case with named ticks but
positional axes, however. That is, you should be able to use .named
directly on the top-level datarray without referring to any axis
labels, to say something like arr.named['Netherlands', 2010], but you
can't yet.
-- Rob

On Thu, Jul 8, 2010 at 11:44 PM, Keith Goodman  wrote:
> On Thu, Jul 8, 2010 at 1:20 PM, Fernando Perez  wrote:
>
>> The consensus at the  BoF (not that it means it's set in stone, simply
>> that there was  good chance for back-and-forth on the topic with many
>> voices) was that:
>>
>> 1. There are valid use cases for 'integer ticks',  i.e. integers that
>> index arbitrarily into an  array instead of in 0..N-1 fashion.
>>
>> 2. That having plain arr[0] give anything but the first element in arr
>> would be way too confusing in practice, and likely to cause too many
>> problems.
>>
>> 3. That the  best solution to allow integer ticks while retaining
>> 'normal' indexing semantics for integers would be to have
>>
>> arr[int] -> normal indexing
>> arr.somethin[int] -> tick-based indexing, where an int can mean anything.
>
> Has the Scipy 2010 BOF consensus been implemented in anyone's fork? I
> don't understand the indexing so I'd like to try it.
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] BOF notes: Fernando's proposal: NumPy ndarray with named axes

2010-07-08 Thread Rob Speer
> I think we have to start from the nD case, even if I (and I think most
> users) will tend to think in 2D.  The rest is just going to have to be
> up to developers how they want users to interact with what we, the
> developers, see as axes.  No end-user wants to think about the 6th
> axis of the data, but I don't want to be pegged into rows and columns
> thinking because I don't think it works for the below example.

In a lot of tasks, you can get by with just two axes of data, and it
is intuitive to refer to them as "rows" and "columns". If you have
more than two axes, then ideally you give them particular labels, or
otherwise you can still call them axes[0], axes[1], axes[2], etc.

But the question is, if we have convenient names like this for
matrices, should .row on a 3-d array raise an error, or should it give
you axes[0] even though that's more like a plane than a row?
-- Rob
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] BOF notes: Fernando's proposal: NumPy ndarray with named axes

2010-07-08 Thread Rob Speer
> 3. That the  best solution to allow integer ticks while retaining
> 'normal' indexing semantics for integers would be to have
>
> arr[int] -> normal indexing
> arr.somethin[int] -> tick-based indexing, where an int can mean anything.

All right, it's clear lots of people like it better this way, so I
made arr.named use square brackets instead of parentheses.

Before:
>>> arr.country.named('Spain').year.named(slice(1994, 2010))
After:
>>> arr.country.named['Spain'].year.named[1994:2010]

-- Rob
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] BOF notes: Fernando's proposal: NumPy ndarray with named axes

2010-07-08 Thread Rob Speer
>> Still, I have a question. Did you also agree that this should forcibly index
>> through ticks?
>>
>>  arr.something[int]      -> tick-based indexing
>>
>
> Yes.

I feel like people are talking about different things because it's
unclear what the .something is.

If the .something is an axis name, then no. arr.year[0] should get the
first year in the data, not the data from the "year 0".

If the .something is the attribute we use for named lookup (such as
".named"), then yes. arr.named[2006] should get whatever tick is named
2006 on the first axis.
-- Rob
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] BOF notes: Fernando's proposal: NumPy ndarray with named axes

2010-07-08 Thread Rob Speer
> No. I'd rather go for eliminating the 'arr.year.named', and providing only:
>  * arr.__getitem__
>  * arr.named.__getitem__
>  * arr..__getitem__
>
> The first being just the current ndarray.__getitem__, and the two last methods
> would accept both strings and integers, assuming that names/ticks based on
> integers (e.g., the 1994 above) must be provided as strings, or otherwise are
> treated as good old array indexes.

There are lots of data types besides strings that make good names
(tuples, for example).

My impression from SciPy was that people would prefer separate
accessors for names and indices, especially because integers (a really
common data type, after all) shouldn't be forbidden. Also, working
with strings of integers like '2010' makes me feel like I'm using PHP,
a feeling I like to avoid whenever possible. :)

-- Rob
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] BOF notes: Fernando's proposal: NumPy ndarray with named axes

2010-07-08 Thread Rob Speer
On Thu, Jul 8, 2010 at 2:27 PM, Skipper Seabold  wrote:
> On Thu, Jul 8, 2010 at 1:35 PM, Rob Speer  wrote:
>> Your labels are unique if you look at them the right way. Here's how I
>> would represent that in a datarray:
>> * axis0 = 'city', ['Austin', 'Boston', ...]
>> * axis1 = 'month', ['January', 'February', ...]
>> * axis2 = 'year', [1980, 1981, ...]
>> * axis3 = 'region', ['Northeast', 'South', ...]
>> * axis4 = 'measurement', ['precipitation', 'temperature']
>>
>> and then I'd make a 5-D datarray labeled with [axis0, axis1, axis2,
>> axis3, axis4].
>>
>
> Yeah, this is what I was thinking I would have to do, but it's still
> not clear to me (I have trouble trying to think in 5 dimensions...).
> For instance, what axis holds my actual numeric data?
>
> axis4, with a "precipitation" tick?

Yep, that's what I was suggesting. Or you could have two different 4-D
matrices, one whose values are precipitation and one whose values are
temperatures.

>> Now I realize not everyone wants to represent their tabular data as a
>> big tensor that they index every which way, and I think this is one
>> thing that pandas is for.
>
> This is kind of where I would like the divide to be between user and
> developer.  On top of all of this, I would like to see a __repr__ or
> something that actually spits out a 2d spreadsheet-looking
> representation.  It would help me stay sane I think.  Fernando's nice
> 3D graphic only can go so far as a mental model (for me at least).

Divisi2 uses a 2D labeled representation as its __str__ -- an example
is at http://csc.media.mit.edu/docs/divisi2/sparse.html

I could port this onto datarray. I was holding off because I was
unsure about how to represent the N-d case, but I realize now that
showing the entries in this kind of 2-D tabular format could actually
be a really intuitive way to do it.

> Mix-ins sounds reasonable to me as long as this could easily be
> accomplished.  Ie., why use csr?  Can you go between others?  Are the
> sparse matrices reasonably stable given recent activity?  Not
> rhetorical questions, I don't use sparse matrices much.

These are good questions.

I ended up using PySparse instead of scipy.sparse, because SciPy 0.7's
sparse matrices weren't ready to support many important operations,
particularly slicing. SciPy 0.8's sparse matrices look much better,
and I may transition to using them once it's released.

When planning future features of NumPy, of course, we should assume
SciPy's sparse matrices do what we want (and possibly fix them if they
don't).

csr_matrix was just an example. I think there would have to be
separate classes for labeled csr_matrices, labeled lil_matrices, and
so on, supporting all the usual methods for converting between them.
-- Rob
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] BOF notes: Fernando's proposal: NumPy ndarray with named axes

2010-07-08 Thread Rob Speer
>> But I don't understand your second example:
>>>   arr.country['Spain'].year[1994:2010]
>
>> That seems to run straight into the index/name ambiguity. Shouldn't
>> that take the 1994th through 2010th indices along the "year" axis? Not
>> every axis will have names, so you can't make *all* the indexing go by
>> names.
>
> Sorry, I just c&p without placing the necessary '.
>
>> If named were a getitem-able object, that would be:
> arr.country.named['Spain'].year.named[1994:2010]
>
> Or what I was striving for:
>
>   arr.year.named[1994:2010]
>   arr.year['1994':'2010']
>   arr.year['1994':-3]

So your proposal is, whenever there's an index that is not an integer,
look it up by name, and use .named only if you want integer tick
names? This feels too inconsistent to me. It adds a fair amount of
confusion to save a small amount of typing. If keystrokes are that
important, I'd rather replace "named" with something shorter than lose
the distinction entirely.
-- Rob
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] BOF notes: Fernando's proposal: NumPy ndarray with named axes

2010-07-08 Thread Rob Speer
> Forgive me if this is has already been addressed, but my question is
> what happens when we have more than one "label" (not as in a labeled
> axis but an observation label -- but not a tick because they're not
> unique!) per say row axis and heterogenous dtypes.  This is really the
> problem that I would like to see addressed and from the BoF comments
> I'm not sure this use case is going to be covered.  I'm also not sure
> I expressed myself clearly enough or understood what's already
> available.  For me, this is the single most common use case and most
> of what we are talking about now is just convenient slicing but
> ignoring some basic and prominent concerns.  Please correct me if I'm
> wrong.  I need to play more with DataArray implementation but haven't
> had time yet.
>
> I often have data that looks like this (not really, but it gives the
> idea in a general way I think).
>
> city, month, year, region, precipitation, temperature
> "Austin", "January", 1980, "South", 12.1, 65.4,
> "Austin", "February", 1980, "South", 24.3, 55.4
> "Austin", "March", 1980, "South", 3, 69.1
> 
> "Austin", "December", 2009, 1, 62.1
> "Boston", "January", 1980, "Northeast", 1.5, 19.2
> 
> "Boston","December", 2009, "Northeast", 2.1, 23.5
> ...
> "Memphis","January",1980, "South", 2.1, 35.6
> ...
> "Memphis","December",2009, "South", 1.2, 33.5
> ...

Your labels are unique if you look at them the right way. Here's how I
would represent that in a datarray:
* axis0 = 'city', ['Austin', 'Boston', ...]
* axis1 = 'month', ['January', 'February', ...]
* axis2 = 'year', [1980, 1981, ...]
* axis3 = 'region', ['Northeast', 'South', ...]
* axis4 = 'measurement', ['precipitation', 'temperature']

and then I'd make a 5-D datarray labeled with [axis0, axis1, axis2,
axis3, axis4].

Now I realize not everyone wants to represent their tabular data as a
big tensor that they index every which way, and I think this is one
thing that pandas is for.

Oh, and the other problem with the 5-D datarray is that you'd probably
want it to be sparse. This is another discussion worth having.

I want to eventually replace the labeling stuff in Divisi with
datarray, but sparse matrices are largely the point of using Divisi.
So how do we make a sparse datarray?

One answer would be to have datarray be a wrapper that encapsulates
any sufficiently matrix-like type. This is approximately what I did in
the now-obsolete Divisi1. Nobody liked the fact that you had to wrap
and unwrap your arrays to accomplish anything that we hadn't thought
of in writing Divisi. I would not recommend this route.

The other option, which is more like Divisi2. would be to provide the
functionality of datarray using a mixin. Then a standard dense
datarray could inherit from (np.ndarray, Datarray), while a sparse
datarray could inherit from (sparse.csr_matrix, Datarray), for
example.

-- Rob
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] BOF notes: Fernando's proposal: NumPy ndarray with named axes

2010-07-08 Thread Rob Speer
> While I haven't had a chance to really look in-depth at the changes
> myself (I'm a busy man! So many mailing lists!), I so far like the
> look and sound of them. That's just my opinion, though.

If people are okay with the attribute magic, I have a proposal for more of it.

In my own project where I use labeled arrays
(http://github.com/commonsense/divisi2), I don't have labeled axes.
But I assumed everything was 1 or 2-D, and gave the 2-D matrices
methods like "row_named", "col_named", etc., to encourage readable
code.

With the current implementation of datarray, I could get that by
labeling the axes "row" and "col", except the moment you transpose a
matrix like that you get rows named "col" and columns named "row", so
that's not the right answer.

My proposal is that datarray.row should be equivalent to
datarray.axes[0], and datarray.column should be equivalent to
datarray.axes[1], so that you can always ask for something like
"arr.column.named(2010)" (replace those with square brackets if you
like).

Not sure yet what the right way is to generalize this to 1-D and n-D.
-- Rob
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] BOF notes: Fernando's proposal: NumPy ndarray with named axes

2010-07-08 Thread Rob Speer
On Thu, Jul 8, 2010 at 7:13 AM, Lluís  wrote:
> Thus, we can use something in the middle:
>
>   arr[0,1]
>   arr.names['Netherlands',2010] # I'd rather go for 'names' instead of 'ticks'

Ah ha. So this is the case with positional axes but named ticks, which
we haven't really brought up yet. I'm definitely thinking of making
the top-level datarray support "named" as well, which would make it
into:
>>> arr.named('Netherlands', 2010)

But the other change you've got here is to make "named" into a
__getitem__-able object instead of a method, so you use square
brackets with it and can use slice syntax. I could do it this way as
well.

But I don't understand your second example:
>   arr.country['Spain'].year[1994:2010]

That seems to run straight into the index/name ambiguity. Shouldn't
that take the 1994th through 2010th indices along the "year" axis? Not
every axis will have names, so you can't make *all* the indexing go by
names.

If named were a getitem-able object, that would be:
>>> arr.country.named['Spain'].year.named[1994:2010]

-- Rob
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] BOF notes: Fernando's proposal: NumPy ndarray with named axes

2010-07-07 Thread Rob Speer
Glad I finally found this discussion.

I implemented some of the ideas from the SciPy BOAF discussion, and
Joshua has already merged them into his datarray on GitHub (thanks,
Joshua, for being so fast on the merge button).

To introduce these changes, here's a couple of examples of how you
could index into a matrix whose rows represent countries, and whose
columns represent something that is observed every four years
(hmm...).
>>> arr.country.named('Netherlands').year.named(2010)
>>> arr.country.named('Spain').year.named(slice(1994, 2010))
>>> arr.year.named(2006).country[0:2]

First of all, a bit of terminology. Axes can have labels. Ticks (which
are particular rows, columns, etc.) can have names. Axes and ticks
also have indices (the sequential numbers they've always had). Feel
free to suggest alternate terminology, I just used what sounded the
most natural to me in the method names.

Addressing by indices and addressing by tick names are separate, which
allows integers to be tick names without a conflict. You use the
"named" method of an axis to address it by name, while __getitem__
only addresses it by indices. You can still take slices of names
(makes sense for things like years), but you have to spell out "slice"
because it's not inside square brackets.

Then, at the axis level: My impression from the SciPy discussion was
that people wanted to be able to look up multiple labeled axes at once
without repeating themselves, and .aix and stuples were not
satisfying, but we didn't come up with anything else during the
discussion.

My choice was to add a bit of attribute magic: if you get an attribute
of a datarray that is (a) not a real attribute and (b) matches the
label of one of its axes, you'll get that axis. So "arr.axis.country"
can be shortened to "arr.country", for example, but if you decided to
name your axis "T", you would be stuck with "arr.axis.T".

So this is the state of the code at http://github.com/rspeer/datarray
(and also at http://github.com/jesusabdullah/datarray now). I'll even
try to make the documentation catch up with this code if people think
the changes are good.
-- Rob
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion