On Sat, Jun 25, 2011 at 3:51 PM, Nathaniel Smith <n...@pobox.com> wrote: > On Sat, Jun 25, 2011 at 11:32 AM, Benjamin Root <ben.r...@ou.edu> wrote: >> On Sat, Jun 25, 2011 at 12:05 PM, Nathaniel Smith <n...@pobox.com> wrote: >>> I guess that is a difference, but I'm trying to get at something more >>> fundamental -- not just what operations are allowed, but what >>> operations people *expect* to be allowed. >> >> That is quite a trickier problem. > > It can be. I think of it as the difference between design and coding. > They overlap less than one might expect... > >>> Here's another possible difference -- in (1), intuitively, missingness >>> is a property of the data, so the logical place to put information >>> about whether you can expect missing values is in the dtype, and to >>> enable missing values you need to make a new array with a new dtype. >>> (If we use a mask-based implementation, then >>> np.asarray(nomissing_array, dtype=yesmissing_type) would still be able >>> to skip making a copy of the data -- I'm talking ONLY about the >>> interface here, not whether missing data has a different storage >>> format from non-missing data.) >>> >>> In (2), the whole point is to use different masks with the same data, >>> so I'd argue masking should be a property of the array object rather >>> than the dtype, and the interface should logically allow masks to be >>> created, modified, and destroyed in place. >>> >> >> I can agree with this distinction. However, if "missingness" is an >> intrinsic property of the data, then shouldn't users be implementing their >> own dtype tailored to the data they are using? In other words, how far does >> the core of NumPy need to go to address this issue? And how far would be >> "too much"? > > Yes, that's exactly my question: whether our goal is to implement > missingness in numpy or not! > >>> >>> They're both internally consistent, but I think we might have to make >>> a decision and stick to it. >>> >> >> Of course. I think that Mark is having a very inspired idea of giving the R >> audience what they want (np.NA), while simultaneously making the use of >> masked arrays even easier (which I can certainly appreciate). > > I don't know. I think we could build a really top-notch implementation > of missingness. I also think we could build a really top-notch > implementation of masking. But my suggestions for how to improve the > current design are totally different depending on which of those is > the goal, and neither the R audience (like me) nor the masked array > audience (like you) seems really happy with the current design. And I > don't know what the goal is -- maybe it's something else and the > current design hits it perfectly? Maybe we want a top-notch > implementation of *both* missingness and masking, and those should be > two different things that can be combined, so that some of the > unmasked values inside a masked array can be NA? I don't know. > >> I will put out a little disclaimer. I once had to use S+ for a class. To >> be honest, it was the worst programming experience in my life. This >> experience may be coloring my perception of R's approach to handling missing >> data. > > There's a lot of things that R does wrong (not their fault; language > design is an extremely difficult and specialized skill, that > statisticians are not exactly trained in), but it did make a few > excellent choices at the beginning. One was to steal the execution > model from Scheme, which, uh, isn't really relevant here. The other > was to steal the basic data types and standard library that the Bell > Labs statisticians had pounded into shape over many years. I use > Python now because using R for everything would drive me crazy, but > despite its many flaws, it still does some things so well that it's > become *the* language used for basically all statistical research. I'm > only talking about stealing those things :-). > > -- Nathaniel > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion@scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion >
+1. Everyone knows R ain't perfect. I think it's an atrociously bad programming language but it can be unbelievably good at statistics, as evidenced by its success. Brings to mind Andy Gelman's blog last fall: http://www.stat.columbia.edu/~cook/movabletype/archives/2010/09/ross_ihaka_to_r.html As someone in a statistics department I've frequently been disheartened when I see how easy many statistical things are in R and how much more difficult they are in Python. This is partially the result of poor interfaces for statistical modeling, partially due to data structures (e.g. the integrated-ness of data.frame throughout R) and things like handling of missing data of which there's currently no equivalent. - Wes _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion