Re: [Numpy-discussion] NA/Missing Data Conference Call Summary

Matthew Brett Wed, 06 Jul 2011 11:03:31 -0700

Hi,

On Wed, Jul 6, 2011 at 6:54 PM, Christopher Jordan-Squire
<cjord...@uw.edu> wrote:
>
>
> On Wed, Jul 6, 2011 at 5:05 AM, Matthew Brett <matthew.br...@gmail.com>
> wrote:
>>
>> Hi,
>>
>> Just for reference, I am using this as the latest version of the NEP -
>> I hope it's current:
>>
>>
>> https://github.com/m-paradox/numpy/blob/7b10c9ab1616b9100e98dd2ab80cef639d5b5735/doc/neps/missing-data.rst
>>
>> I'm mostly relaying stuff I said, although generally (please do
>> correct me if I am wrong) I am just re-expressing points that
>> Nathaniel has already made in the alterNEP text and the emails.
>>
>> On Wed, Jul 6, 2011 at 12:46 AM, Christopher Jordan-Squire
>> <cjord...@uw.edu> wrote:
>> ...
>> > Since we only have Mark is only around Austin until early August,
>> > there's
>> > also broad agreement that we need to get something done quickly.
>>
>> I think I might have missed that part of the discussion :)
>>
>
> I think that might have been mentioned by Travis right before he had to
> leave for another meeting, which might have been after you'd disconnected.
> Travis' concern as a member of a numpy community is the desire for something
> that is broadly applicable and adopted. But as Mark's employer, his concern
> is to get a more complete and coherent missing data functionality
> implemented in numpy while Mark is still at Enthought, for use in the
> problems Enthought and statisticians commonly encounter if nothing else.


Sorry - yes - I wasn't there for all the conversation.   Of course
(not disagreeing), we must take care to get the API right because it's
unlikely to change and will be explaining and supporting it for a long
time to come.

>> I feel the need to emphasize the centrality of the assertion by
>> Nathaniel, and agreement by (at least) me, that the NA case (there
>> really is no data) and the IGNORE case (there is data but I'm
>> concealing it from you) are conceptually different, and come from
>> different use-cases.
>>
>> The underlying disagreement returned many times to this fundamental
>> difference between the NEP and alterNEP:
>>
>> In the NEP - by design - it is impossible to distinguish between na.NA
>> and na.IGNORE
>> The alterNEP insists you should be able to distinguish.
>>
>> Mark says something like "it's all missing data, there's no reason you
>> should want to distinguish".  Nathaniel and I were saying "the two
>> types of missing do have different use-cases, and it should be
>> possible to distinguish.  You might want to chose to treat them the
>> same, but you should be able to see what they are.".
>>
>> I returned several times to this (original point by Nathaniel):
>>
>> a[3] = np.NA
>>
>> (what does this mean?   I am altering the underlying array, or a mask?
>>  How would I explain this to someone?)
>>
>> We confirmed that, in order to make it difficult to know what your NA
>> is (masked or bit-pattern), Mark has to a) hinder access to the data
>> below the mask and b) prevent direct API access to the masking array.
>> I described this as 'hobbling the API' and Mark thought of it as
>> 'generic programming' (missing is always missing).
>>
>> I asserted that explaining NA to people would be easier if ``a[3] =
>> np.NA`` was direct assignment and altered the array.
>>
>> > BIT PATTERN & MASK IMPLEMENTATIONS FOR NA
>> >
>> > ------------------------------------------------------------------------------------------
>> > The current NEP proposes both mask and bit pattern implementations for
>> > missing data. I use the terms bit pattern and parameterized dtype
>> > interchangeably, since the parameterized dtype will use a bit pattern
>> > for
>> > its implementation. The two implementations will support the same
>> > functionality with respect to NA, and the implementation details will be
>> > largely invisible to the user. Their differences are in the 'extra'
>> > features
>> > each supports.
>> >
>> > Two common questions were:
>> > 1. Why make two implementations of missing data: one with masks and the
>> > other with parameterized dtypes?
>> > 2. Why does the implementation using masks have higher priority?
>> > The answers are:
>> > 1.  The mask implementation is more general and easier to implement and
>> > maintain.  The bit pattern implementation saves memory, makes
>> > interoperability easier, and makes ABI (Application Binary Interface)
>> > compatibility easier. Since each has different strengths, the argument
>> > is
>> > both should be implemented.
>> > 2. The implementation for the parameterized dtypes will rely on the
>> > implementation using a mask.
>> >
>> > NA VS. IGNORE
>> > ---------------------------------
>> > A lot of discussion centered on IGNORE vs. NA types. We take IGNORE in
>> > aNEP
>> > sense and NA in  NEP sense. With NA, there is a clear notion of how NA
>> > propagates through all basic numpy operations.  (e.g., 3+NA=NA and
>> > log(NA) =
>> > NA, while NA | True = True.) IGNORE is separate from NA, with different
>> > interpretations depending on the use case.
>> > IGNORE could mean:
>> > 1. Data that is being temporarily ignored. e.g., a possible outlier that
>> > is
>> > temporarily being removed from consideration.
>> > 2. Data that cannot exist. e.g., a matrix representing a grid of water
>> > depths for a lake. Since the lake isn't square, some entries will
>> > represent
>> > land, and so depth will be a meaningless concept for those entries.
>> > 3. Using IGNORE to signal a jagged array. e.g., [ [1, 2, IGNORE],
>> > [IGNORE,
>> > 3, 4] ] should behave exactly the same as [ [1 , 2] , [3 , 4] ]. Though
>> > this
>> > leaves open how [1, 2, IGNORE] + [3 , 4] should behave.
>> > Because of these different uses of IGNORE, it doesn't have as clear a
>> > theoretical interpretation as NA. (For instance, what is IGNORE+3,
>> > IGNORE*3,
>> > or IGNORE | True?)
>>
>> I don't remember this bit of the discussion, but I see from current
>> masked arrays that IGNORE is treated as the identity, so:
>>
>> IGNORE + 3 = 3
>> IGNORE * 3 = 3
>>
>
> I'd mentioned at the top of my summary that some of the concrete examples
> weren't talked about, even though the ideas were. And the fact that IGNORE
> doesn't have a computational model behind it was mentioned briefly, though
> it wasn't expanded on.
> If we follow those rules for IGNORE for all computations, we sometimes get
> some weird output. For example:
> [ [1, 2], [3, 4] ] * [ IGNORE, 7] = [ 15, 31 ]. (Where * is matrix multiply
> and not * with broadcasting.) Or should that sort of operation through an
> error?

I'm sorry to say that I haven't thought about ignore semantics very
much!  What does masked array do?

See you,

Matthew
_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] NA/Missing Data Conference Call Summary

Reply via email to