Re: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray

Wes McKinney Sat, 25 Jun 2011 07:44:42 -0700

On Sat, Jun 25, 2011 at 10:25 AM, Charles R Harris
<charlesr.har...@gmail.com> wrote:
>
>
> On Sat, Jun 25, 2011 at 8:14 AM, Wes McKinney <wesmck...@gmail.com> wrote:
>>
>> On Sat, Jun 25, 2011 at 12:42 AM, Charles R Harris
>> <charlesr.har...@gmail.com> wrote:
>> >
>> >
>> > On Fri, Jun 24, 2011 at 10:06 PM, Wes McKinney <wesmck...@gmail.com>
>> > wrote:
>> >>
>> >> On Fri, Jun 24, 2011 at 11:59 PM, Nathaniel Smith <n...@pobox.com>
>> >> wrote:
>> >> > On Fri, Jun 24, 2011 at 6:57 PM, Benjamin Root <ben.r...@ou.edu>
>> >> > wrote:
>> >> >> On Fri, Jun 24, 2011 at 8:11 PM, Nathaniel Smith <n...@pobox.com>
>> >> >> wrote:
>> >> >>> This is a situation where I would just... use an array and a mask,
>> >> >>> rather than a masked array. Then lots of things -- changing fill
>> >> >>> values, temporarily masking/unmasking things, etc. -- come from
>> >> >>> free,
>> >> >>> just from knowing how arrays and boolean indexing work?
>> >> >>
>> >> >> With a masked array, it is "for free".  Why re-invent the wheel?  It
>> >> >> has
>> >> >> already been done for me.
>> >> >
>> >> > But it's not for free at all. It's an additional concept that has to
>> >> > be maintained, documented, and learned (with the last cost, which is
>> >> > multiplied by the number of users, being by far the greatest). It's
>> >> > not reinventing the wheel, it's saying hey, I have wheels and axles,
>> >> > but what I really need the library to provide is a wheel+axle
>> >> > assembly!
>> >>
>> >> You're communicating my argument better than I am.
>> >>
>> >> >>> Do we really get much advantage by building all these complex
>> >> >>> operations in? I worry that we're trying to anticipate and write
>> >> >>> code
>> >> >>> for every situation that users find themselves in, instead of just
>> >> >>> giving them some simple, orthogonal tools.
>> >> >>>
>> >> >>
>> >> >> This is the danger, and which is why I advocate retaining the
>> >> >> MaskedArray
>> >> >> type that would provide the high-level "intelligent" operations,
>> >> >> meanwhile
>> >> >> having in the core the basic data structures for  pairing a mask
>> >> >> with
>> >> >> an
>> >> >> array, and to recognize a special np.NA value that would act upon
>> >> >> the
>> >> >> mask
>> >> >> rather than the underlying data.  Users would get very basic
>> >> >> functionality,
>> >> >> while the MaskedArray would continue to provide the interface that
>> >> >> we
>> >> >> are
>> >> >> used to.
>> >> >
>> >> > The interface as described is quite different... in particular, all
>> >> > aggregate operations would change their behavior.
>> >> >
>> >> >>> As a corollary, I worry that learning and keeping track of how
>> >> >>> masked
>> >> >>> arrays work is more hassle than just ignoring them and writing the
>> >> >>> necessary code by hand as needed. Certainly I can imagine that *if
>> >> >>> the
>> >> >>> mask is a property of the data* then it's useful to have tools to
>> >> >>> keep
>> >> >>> it aligned with the data through indexing and such. But some of
>> >> >>> these
>> >> >>> other things are quicker to reimplement than to look up the docs
>> >> >>> for,
>> >> >>> and the reimplementation is easier to read, at least for me...
>> >> >>
>> >> >> What you are advocating is similar to the "tried-n-true" coding
>> >> >> practice of
>> >> >> Matlab users of using NaNs.  You will hear from Matlab programmers
>> >> >> about how
>> >> >> it is the greatest idea since sliced bread (and I was one of them).
>> >> >> Then I
>> >> >> was introduced to Numpy, and I while I do sometimes still do the NaN
>> >> >> approach, I realized that the masked array is a "better" way.
>> >> >
>> >> > Hey, no need to go around calling people Matlab programmers, you
>> >> > might
>> >> > hurt someone's feelings.
>> >> >
>> >> > But seriously, my argument is that every abstraction and new concept
>> >> > has a cost, and I'm dubious that the full masked array abstraction
>> >> > carries its weight and justifies this cost, because it's highly
>> >> > redundant with existing abstractions. That has nothing to do with how
>> >> > tried-and-true anything is.
>> >>
>> >> +1. I think I will personally only be happy if "masked array" can be
>> >> implemented while incurring near-zero cost from the end user
>> >> perspective. If what we end up with is a faster implementation of
>> >> numpy.ma in C I'm probably going to keep on using NaN... That's why
>> >> I'm entirely insistent that whatever design be dogfooded on non-expert
>> >> users. If it's very much harder / trickier / nuanced than R, you will
>> >> have failed.
>> >>
>> >
>> > This sounds unduly pessimistic to me. It's one thing to suggest
>> > different
>> > approaches, another to cry doom and threaten to go eat worms. And all
>> > before
>> > the code is written, benchmarks run, or trial made of the usefulness of
>> > the
>> > approach. Let us see how things look as they get worked out. Mark has a
>> > good
>> > track record for innovative tools and I'm rather curious myself to see
>> > what
>> > the result is.
>> >
>> > Chuck
>> >
>> >
>> > _______________________________________________
>> > NumPy-Discussion mailing list
>> > NumPy-Discussion@scipy.org
>> > http://mail.scipy.org/mailman/listinfo/numpy-discussion
>> >
>> >
>>
>> I hope you're right. So far it seems that anyone who has spent real
>> time with R (e.g. myself, Nathaniel) has expressed serious concerns
>> about the masked approach. And we got into this discussion at the Data
>> Array summit in Austin last month because we're trying to make Python
>> more competitive with R viz statistical and financial applications.
>> I'm just trying to be (R)ealistic =P Remember that I very earnestly am
>> doing everything I can these days to make scientific Python more
>> successful in finance and statistics. One big difference with R's
>> approach is that we care more about performance the the R community
>> does. So maybe having special NA values will be prohibitive for that
>> reason.
>>
>> Mark indeed has a fantastic track record and I've been extremely
>> impressed with his NumPy work, so I've no doubt he'll do a good job. I
>> just hope that you don't push aside my input-- my opinions are formed
>> entirely based on my domain experience.
>>
>
> I think what we really need to see are the use cases and work flow. The ones
> that hadn't occurred to me before were memory mapped files and data stored
> on disk in general. I think we may need some standard format for masked data
> on disk if we don't go the NA value route.
>
> Chuck
>
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>


Here are some things I can think of that would be affected by any changes here

1) Right now users of pandas can type pandas.isnull(series[5]) and
that will yield True if the value is NA for any dtype. This might be
hard to support in the masked regime
2) Functions like {Series, DataFrame}.fillna would hopefully look just
like this:

# value is 0 or some other value to fill
new_series = self.copy()
new_series[isnull(new_series)] = value

Keep in mind that people will write custom NA handling logic. So they might do:

series[isnull(other_series) & isnull(other_series2)] = val

3) Nulling / NA-ing out data is very common

# null out this data up to and including date1 in these three columns
frame.ix[:date1, [col1, col2, col3]] = NaN

# But this should work fine too
frame.ix[:date1, [col1, col2, col3]] = 0

I'll try to think of some others. The main thing is that the NA value
is very easy to think about and fits in naturally with how people (at
least statistical / financial users) think about and work with data.
If you have to say "I have to set these mask locations to True" it
introduces additional mental effort compared with "I'll just set these
values to NA"
_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray

Reply via email to