On Tue, Jun 28, 2011 at 7:34 AM, Lluís <xscr...@gmx.net> wrote: > Mark Wiebe writes: > > The design that's forming is a combination of: > > > * Solve the missing data problem > > * My ideas of what a good solution looks like: > > * applies to all NumPy dtypes in a fully general way > > * high-performance, low overhead where possible > > * makes the C-level implementation of NumPy nicer to work with, not > harder > > * easy to use from Python for unskilled programmers > > * easy to use more powerful functionality from Python for skilled > programmers > > * satisfies all or most of the needs of the many users of arrays with > a "missing data" aspect to them > > I would add here an efficient mechanism to reinterpret exising data with > different missing information (no copies of the backing array). > > Although I'm not sure whether this requires first-class citizenship or > not. >
I'm calling this idea "masking semantics" generally. > * All the feedback I'm getting from discussions on the list > [...] > > I've updated a section "Parameterized Data Type With NA Signal Values" > > in the NEP with an idea for now an NA bit pattern approach could > > coexist and work together with the mask-based approach. I think I've > > solved some of the generality and implementation obstacles, it would > > be great to get some feedback on that. > > Some (obvious) thoughts about it: > > * Trivial to store, as the missing property is encoded in the value > itself. > * Third-party (non-Python) code needs some interface to interpret these > without having to know the implementation details (although the > interface is rather trivial). > * Data marked as missing loses its original value. > * Reinterpreting the same data (memory buffer) with different missing > information requires either memory copies or separate mask arrays (see > above) > > So, while it (data types with NA signal values) has its advantages on a > simpler interaction with 3rd party code and during long-term storage, > masks will still be needed. > > I think that deciding on the value of NA signal values boils down to > this question: should 3rd party code be able to interpret missing data > information stored in the separate mask array? > I'm tossing around some variations of ideas using the iterator to provide a buffered mask-based interface that works uniformly with both masked arrays and NA dtypes. This way 3rd party C code only needs to implement one missing data mechanism to fully support both of NumPy's missing data mechanisms. -Mark > If the answer is no, then 3rd party code should be given a copy of the > data where the masked array is merged with the ndarray data buffer > (assuming the original ndarray had a masked array before passing it to > the 3rd party code). As by definition (?) the ndarray with a mask must > retain the original data, the result of the 3rd party code must be > translated back into an ndarray + mask. > > If the answer is yes, then I think the NA signal values just add > unnecessary complexity, as the 3rd party code will already need to use > some numpy-specific API to handle missing data through the ndarray > buffer + mask buffer. This reminds me that if 3rd party were to use the > new iterator interface, the interface could be twisted in a way that it > returns only the non-missing parts. For the sake of performance, this > could be optional, so that the default behaviour is to just iterate > through non-missing data but an option can be used to iterate over all > data, and leave missing data handling up to the 3rd party code. > > > My 2 cents, > Lluis > > -- > "And it's much the same thing with knowledge, for whenever you learn > something new, the whole world becomes that much richer." > -- The Princess of Pure Reason, as told by Norton Juster in The Phantom > Tollbooth > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion@scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion >
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion