On Sat, Mar 3, 2012 at 12:30 PM, Travis Oliphant <tra...@continuum.io>wrote:
> <snip> > > First of all, I want to be clear that I think there is much great work > that has been done in the current missing data code. There are some nice > features in the where clause of the ufunc and the machinery for the > iterator that allows re-using ufunc loops that are not re-written to check > for missing data. I'm sure there are other things as well that I'm not > quite aware of yet. However, I don't think the API presented to the > numpy user presently is the correct one for NumPy 1.X. > I thought I might chime in with some implementation-detail notes, as while Travis has dug into the code, I'm still the person who knows it best. A few particulars: > > * the reduction operations need to default to "skipna" --- this is > the most common use case which has been re-inforced again to me today by a > new user to Python who is using masked arrays presently > This is a completely trivial change. I went with the default as I did because it's what R, the primary inspiration for the NA design, does. We'll have to be sure this is well-marked in the documentation about "NumPy NA for R users". > * the mask needs to be visible to the user if they use that > approach to missing data (people should be able to get a hold of the mask > and work with it in Python) > This is relatively easy. Probably the way to do it is with an ndarray.maskna property. It could be in 1.7 if we really push. For the multi-NA future, I think the NPY_MASK dtype, currently an alias for NPY_UBYTE, would need to become its own dtype with separate .exposed and .payload attributes. > * bit-pattern approaches to missing data (at least for float64 and > int32) need to be implemented. > I strongly wanted to do masks first, because of the greater generality and because the bit-patterns would best be implemented sharing mask implementation details. I still believe this was the correct choice, and it set the stage for bit-patterns. It will be possible to make inner loops that specialize for the default hard-coded bit-pattern dtypes. I paid very careful attention in the design making sure high performance is possible without significant rework. The immense scale of the required code changes meant I couldn't actually implement high performance in the time frame. The place I think this affects 1.7 the most is in the default choice for what np.array([1.0, np.NA, 3.0]) and np.array([1, np.NA, 3]) mean. In 1.7, both mean an NA-masked array. In 1.8, I can see a strong case that the first should mean an NA-dtype, and the second an NA-masked array. Also, here's a thought for the usability of NA-float64. As much as global state is a bad idea, something which determines whether implicit float dtypes are NA-float64 or float64 could help. In IPython, "pylab" mode would default to float64, and "statlab" or "pystat" would default to NA-float64. One way to write this might be: >>> np.set_default_float(np.nafloat64) >>> np.array([1.0, 2.0, 3.0]) array([ 1., 2., 3.], dtype=nafloat64) >>> np.set_default_float(np.float64) >>> np.array([1.0, 2.0, 3.0]) array([ 1., 2., 3.], dtype=float64) > * there should be some way when using "masks" (even if it's hidden > from most users) for missing data to separate the low-level ufunc operation > from the operation > on the masks... > This is completely trivial to implement. Maybe ndarray.view(maskna='ignore') is a reasonable way to spell direct access without a mask. Cheers, Mark > I have heard from several users that they will *not use the missing data* > in NumPy as currently implemented, and I can now see why. For better or > for worse, my approach to software is generally very user-driven and very > pragmatic. On the other hand, I'm also a mathematician and appreciate the > cognitive compression that can come out of well-formed structure. > None-the-less, I'm an *applied* mathematician and am ultimately motivated > by applications. > > I will get a hold of the NEP and spend some time with it to discuss some > of this in that document. This will take several weeks (as PyCon is next > week and I have a tutorial I'm giving there). For now, I do not think > 1.7 can be released unless the masked array is labeled *experimental*. > > Thanks, > > -Travis > > > > > > > > > > > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion@scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion >
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion