Re: [Numpy-discussion] Missing data again
Hi Chuck, I think I let my frustration get the better of me, and the message below is too confrontational. I apologize. I truly would like to understand where you're coming from on this, though, so I'll try to make this more productive. My summary of points that no-one has disagreed with yet is here: https://github.com/njsmith/numpy/wiki/NA-discussion-status Of course, this means that there's lots that's left out. Instead of getting into all those contentious details, I'll stick to just a few basic questions that might let us get at least of bit of common ground: 1) Do you disagree with anything that is stated there? 2) Do you feel like that document accurately summarises your basic idea of what this feature is supposed to do (I assume under the IGNORED heading)? Thanks, -- Nathaniel On Wed, Mar 7, 2012 at 11:10 PM, Nathaniel Smith n...@pobox.com wrote: On Wed, Mar 7, 2012 at 7:37 PM, Charles R Harris charlesr.har...@gmail.com wrote: On Wed, Mar 7, 2012 at 12:26 PM, Nathaniel Smith n...@pobox.com wrote: When it comes to missing data, bitpatterns can do everything that masks can do, are no more complicated to implement, and have better performance characteristics. Maybe for float, for other things, no. And we have lots of otherthings. It would be easier to discuss this if you'd, like, discuss :-(. If you know of some advantage that masks have over bitpatterns when it comes to missing data, can you please share it, instead of just asserting it? Not that I'm immune... I perhaps should have been more explicit myself, when I said performance characteristics, let me clarify that I was thinking of both speed (for floats) and memory (for most-but-not-all things). The performance is a strawman, How many users need to speak up to say that this is a serious problem they have with the current implementation before you stop calling it a strawman? Because when Wes says that it's not going to fly for his stats/econometics cases, and the neuroimaging folk like Gary and Matt say it's not going to fly for their use cases... surely just waving that away is a bit dismissive? I'm not saying that we *have* to implement bitpatterns because performance is *the most important feature* -- I'm just saying, well, what I said. For *missing data use* cases, bitpatterns have better performance characteristics than masks. If we decide that these use cases are important, then we should take this into account and weigh it against other considerations. Maybe what you think is that these use cases shouldn't be the focus of this feature and it should focus on the ignored use cases instead? That would be a legitimate argument... but if that's what you want to say, say it, don't just dismiss your users! and it *isn't* easier to implement. If I thought bitpatterns would be easier to implement, I would have said so... What I said was that they're not harder. You have some extra complexity, mostly in casting, and some reduced complexity -- no need to allocate and manipulate the mask. (E.g., simple same-type assignments and slicing require special casing for masks, but not for bitpatterns.) In many places the complexity is identical -- printing routines need to check for either special bitpatterns or masked values, whatever. Ufunc loops need to either find the appropriate part of the mask, or create a temporary mask buffer by calling a dtype func, whatever. On net they seem about equivalent, complexity-wise. ...I assume you disagree with this analysis, since I've said it before, wrote up a sketch for how the implementation would work at the C level, etc., and you continue to claim that simplicity is a compelling advantage for the masked approach. But I still don't know why you think that :-(. Also, different folks adopt different values for 'missing' data, and distributing one or several masks along with the data is another common practice. True, but not really relevant to the current debate, because you have to handle such issues as part of your general data import workflow anyway, and none of these is any more complicated no matter which implementations are available. One inconvenience I have run into with the current API is that is should be easier to clear the mask from an ignored value without taking a new view or assigning known data. So maybe two types of masks (different payloads), or an additional flag could be helpful. The process of assigning masks could also be made a bit easier than using fancy indexing. So this, uh... this was actually the whole goal of the alterNEP design for masks -- making all this stuff easy for people (like you, apparently?) that want support for ignored values, separately from missing data, and want a nice clean API for it. Basically having a separate .mask attribute which was an ordinary, assignable array broadcastable to the attached array's shape. Nobody seemed interested in talking about it much then
Re: [Numpy-discussion] Missing data again
Hi, Thanks you very much for your lights ! Le 06/03/2012 21:59, Nathaniel Smith a écrit : Right -- R has a very impoverished type system as compared to numpy. There's basically four types: numeric (meaning double precision float), integer, logical (boolean), and character (string). And in practice the integer type is essentially unused, because R parses numbers like 1 as being floating point, not integer; the only way to get an integer value is to explicitly cast to it. Each of these types has a specific bit-pattern set aside for representing NA. And... that's it. It's very simple when it works, but also very limited. I also suspected R to be less powerful in terms of types. However, I think the fact that It's very simple when it works is important to take into account. At the end of the day, when using all the fanciness it is not only about can I have some NAs in my array ? but also how *easily* can I have some NAs in my array ?. It's about balancing the how easy and the how powerful. The easyness-of-use is the reason of my concern about having separate types nafloatNN and floatNN. Of course, I won't argue that not breaking everything is even more important !! Coming back to Travis proposition bit-pattern approaches to missing data (*at least* for float64 and int32) need to be implemented., I wonder what is the amount of extra work to go from nafloat64 to nafloat32/16 ? Is there an hardware support NaN payloads with these smaller floats ? If not, or if it is too complicated, I feel it is acceptable to say it's too complicated and fall back to mask. One may have to choose between fancy types and fancy NAs... Best, Pierre signature.asc Description: OpenPGP digital signature ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Missing data again
On Wed, Mar 7, 2012 at 4:35 PM, Pierre Haessig pierre.haes...@crans.org wrote: Hi, Thanks you very much for your lights ! Le 06/03/2012 21:59, Nathaniel Smith a écrit : Right -- R has a very impoverished type system as compared to numpy. There's basically four types: numeric (meaning double precision float), integer, logical (boolean), and character (string). And in practice the integer type is essentially unused, because R parses numbers like 1 as being floating point, not integer; the only way to get an integer value is to explicitly cast to it. Each of these types has a specific bit-pattern set aside for representing NA. And... that's it. It's very simple when it works, but also very limited. I also suspected R to be less powerful in terms of types. However, I think the fact that It's very simple when it works is important to take into account. At the end of the day, when using all the fanciness it is not only about can I have some NAs in my array ? but also how *easily* can I have some NAs in my array ?. It's about balancing the how easy and the how powerful. The easyness-of-use is the reason of my concern about having separate types nafloatNN and floatNN. Of course, I won't argue that not breaking everything is even more important !! It's a good point, I just don't see how we can really tell what the trade-offs are at this point. You should bring this up again once more of the big picture stuff is hammered out. Coming back to Travis proposition bit-pattern approaches to missing data (*at least* for float64 and int32) need to be implemented., I wonder what is the amount of extra work to go from nafloat64 to nafloat32/16 ? Is there an hardware support NaN payloads with these smaller floats ? If not, or if it is too complicated, I feel it is acceptable to say it's too complicated and fall back to mask. One may have to choose between fancy types and fancy NAs... All modern floating point formats can represent NaNs with payloads, so in principle there's no difficulty in supporting NA the same way for all of them. If you're using float16 because you want to offload computation to a GPU then I would test carefully before trusting the GPU to handle NaNs correctly, and there may need to be a bit of care to make sure that casts between these types properly map NAs to NAs, but generally it should be fine. -- Nathaniel ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Missing data again
On Wed, Mar 7, 2012 at 9:35 AM, Pierre Haessig pierre.haes...@crans.orgwrote: Hi, Thanks you very much for your lights ! Le 06/03/2012 21:59, Nathaniel Smith a écrit : Right -- R has a very impoverished type system as compared to numpy. There's basically four types: numeric (meaning double precision float), integer, logical (boolean), and character (string). And in practice the integer type is essentially unused, because R parses numbers like 1 as being floating point, not integer; the only way to get an integer value is to explicitly cast to it. Each of these types has a specific bit-pattern set aside for representing NA. And... that's it. It's very simple when it works, but also very limited. I also suspected R to be less powerful in terms of types. However, I think the fact that It's very simple when it works is important to take into account. At the end of the day, when using all the fanciness it is not only about can I have some NAs in my array ? but also how *easily* can I have some NAs in my array ?. It's about balancing the how easy and the how powerful. The easyness-of-use is the reason of my concern about having separate types nafloatNN and floatNN. Of course, I won't argue that not breaking everything is even more important !! Coming back to Travis proposition bit-pattern approaches to missing data (*at least* for float64 and int32) need to be implemented., I wonder what is the amount of extra work to go from nafloat64 to nafloat32/16 ? Is there an hardware support NaN payloads with these smaller floats ? If not, or if it is too complicated, I feel it is acceptable to say it's too complicated and fall back to mask. One may have to choose between fancy types and fancy NAs... I'm in agreement here, and that was a major consideration in making a 'masked' implementation first. Also, different folks adopt different values for 'missing' data, and distributing one or several masks along with the data is another common practice. One inconvenience I have run into with the current API is that is should be easier to clear the mask from an ignored value without taking a new view or assigning known data. So maybe two types of masks (different payloads), or an additional flag could be helpful. The process of assigning masks could also be made a bit easier than using fancy indexing. Chuck ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Missing data again
Charles R Harris writes: [...] One inconvenience I have run into with the current API is that is should be easier to clear the mask from an ignored value without taking a new view or assigning known data. AFAIR, the inability to directly access a mask attribute was intentional to make bit-patterns and masks indistinguishable from the POV of the array user. What's the workflow that leads you to un-ignore specific elements? So maybe two types of masks (different payloads), or an additional flag could be helpful. Do you mean different NA values? If that's the case, I think it was taken into account when implementing the current mechanisms (and was also mentioned in the NEP), so that it could be supported by both bit-patterns and masks (as one of the main design points was to make them indistinguishable in the common case). I think the name was parametrized dtypes. The process of assigning masks could also be made a bit easier than using fancy indexing. I don't get what you mean here, sorry. Do you mean here that this is too cumbersome to use? a[a 5] = np.NA (obviously oversimplified example where everything looks sufficiently simple :)) Lluis -- And it's much the same thing with knowledge, for whenever you learn something new, the whole world becomes that much richer. -- The Princess of Pure Reason, as told by Norton Juster in The Phantom Tollbooth ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Missing data again
On Wed, Mar 7, 2012 at 11:21 AM, Lluís xscr...@gmx.net wrote: Charles R Harris writes: [...] One inconvenience I have run into with the current API is that is should be easier to clear the mask from an ignored value without taking a new view or assigning known data. AFAIR, the inability to directly access a mask attribute was intentional to make bit-patterns and masks indistinguishable from the POV of the array user. What's the workflow that leads you to un-ignore specific elements? Because they are not 'unknown', just (temporarily) 'ignored'. This might be the case if you are experimenting with what happens if certain data is left out of a fit. The current implementation tries to handle both these case, and can do so, I would just like the 'ignored' use to be more convenient than it is. So maybe two types of masks (different payloads), or an additional flag could be helpful. Do you mean different NA values? If that's the case, I think it was taken into account when implementing the current mechanisms (and was also mentioned in the NEP), so that it could be supported by both bit-patterns and masks (as one of the main design points was to make them indistinguishable in the common case). No, the mask as currently implemented is eight bits and can be extended to handle different mask values, aka, payloads. I think the name was parametrized dtypes. They don't interest me in the least. But that is a whole different area of discussion. The process of assigning masks could also be made a bit easier than using fancy indexing. I don't get what you mean here, sorry. Suppose I receive a data set, say an hdf file, that also includes a mask. I'd like to load the data and apply the mask directly without doing something like data[mask] = np.NA Do you mean here that this is too cumbersome to use? a[a 5] = np.NA (obviously oversimplified example where everything looks sufficiently simple :)) Mostly speed and memory. Chuck ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Missing data again
On Wed, Mar 7, 2012 at 5:17 PM, Charles R Harris charlesr.har...@gmail.com wrote: On Wed, Mar 7, 2012 at 9:35 AM, Pierre Haessig pierre.haes...@crans.org Coming back to Travis proposition bit-pattern approaches to missing data (*at least* for float64 and int32) need to be implemented., I wonder what is the amount of extra work to go from nafloat64 to nafloat32/16 ? Is there an hardware support NaN payloads with these smaller floats ? If not, or if it is too complicated, I feel it is acceptable to say it's too complicated and fall back to mask. One may have to choose between fancy types and fancy NAs... I'm in agreement here, and that was a major consideration in making a 'masked' implementation first. When it comes to missing data, bitpatterns can do everything that masks can do, are no more complicated to implement, and have better performance characteristics. Also, different folks adopt different values for 'missing' data, and distributing one or several masks along with the data is another common practice. True, but not really relevant to the current debate, because you have to handle such issues as part of your general data import workflow anyway, and none of these is any more complicated no matter which implementations are available. One inconvenience I have run into with the current API is that is should be easier to clear the mask from an ignored value without taking a new view or assigning known data. So maybe two types of masks (different payloads), or an additional flag could be helpful. The process of assigning masks could also be made a bit easier than using fancy indexing. So this, uh... this was actually the whole goal of the alterNEP design for masks -- making all this stuff easy for people (like you, apparently?) that want support for ignored values, separately from missing data, and want a nice clean API for it. Basically having a separate .mask attribute which was an ordinary, assignable array broadcastable to the attached array's shape. Nobody seemed interested in talking about it much then but maybe there's interest now? -- Nathaniel ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Missing data again
On Wed, Mar 7, 2012 at 12:26 PM, Nathaniel Smith n...@pobox.com wrote: On Wed, Mar 7, 2012 at 5:17 PM, Charles R Harris charlesr.har...@gmail.com wrote: On Wed, Mar 7, 2012 at 9:35 AM, Pierre Haessig pierre.haes...@crans.org Coming back to Travis proposition bit-pattern approaches to missing data (*at least* for float64 and int32) need to be implemented., I wonder what is the amount of extra work to go from nafloat64 to nafloat32/16 ? Is there an hardware support NaN payloads with these smaller floats ? If not, or if it is too complicated, I feel it is acceptable to say it's too complicated and fall back to mask. One may have to choose between fancy types and fancy NAs... I'm in agreement here, and that was a major consideration in making a 'masked' implementation first. When it comes to missing data, bitpatterns can do everything that masks can do, are no more complicated to implement, and have better performance characteristics. Maybe for float, for other things, no. And we have lots of otherthings. The performance is a strawman, and it *isn't* easier to implement. Also, different folks adopt different values for 'missing' data, and distributing one or several masks along with the data is another common practice. True, but not really relevant to the current debate, because you have to handle such issues as part of your general data import workflow anyway, and none of these is any more complicated no matter which implementations are available. One inconvenience I have run into with the current API is that is should be easier to clear the mask from an ignored value without taking a new view or assigning known data. So maybe two types of masks (different payloads), or an additional flag could be helpful. The process of assigning masks could also be made a bit easier than using fancy indexing. So this, uh... this was actually the whole goal of the alterNEP design for masks -- making all this stuff easy for people (like you, apparently?) that want support for ignored values, separately from missing data, and want a nice clean API for it. Basically having a separate .mask attribute which was an ordinary, assignable array broadcastable to the attached array's shape. Nobody seemed interested in talking about it much then but maybe there's interest now? Come off it, Nathaniel, the problem is minor and fixable. The intent of the initial implementation was to discover such things. These things are less accessible with the current API *precisely* because of the feedback from R users. It didn't start that way. We now have something to evolve into what we want. That is a heck of a lot more useful than endless discussion. Chuck ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Missing data again
On Wed, Mar 7, 2012 at 1:26 PM, Nathaniel Smith n...@pobox.com wrote: On Wed, Mar 7, 2012 at 5:17 PM, Charles R Harris charlesr.har...@gmail.com wrote: On Wed, Mar 7, 2012 at 9:35 AM, Pierre Haessig pierre.haes...@crans.org Coming back to Travis proposition bit-pattern approaches to missing data (*at least* for float64 and int32) need to be implemented., I wonder what is the amount of extra work to go from nafloat64 to nafloat32/16 ? Is there an hardware support NaN payloads with these smaller floats ? If not, or if it is too complicated, I feel it is acceptable to say it's too complicated and fall back to mask. One may have to choose between fancy types and fancy NAs... I'm in agreement here, and that was a major consideration in making a 'masked' implementation first. When it comes to missing data, bitpatterns can do everything that masks can do, are no more complicated to implement, and have better performance characteristics. Not true. bitpatterns inherently destroys the data, while masks do not. For matplotlib, we can not use bitpatterns because it could over-write user data (or we have to copy the data). I would imagine other extension writers would have similar issues when they need to play around with input data in a safe manner. Also, I doubt that the performance characteristics for strings and integers are the same as it is for masks. Ben Root ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Missing data again
Hi, On Wed, Mar 7, 2012 at 11:37 AM, Charles R Harris charlesr.har...@gmail.com wrote: On Wed, Mar 7, 2012 at 12:26 PM, Nathaniel Smith n...@pobox.com wrote: On Wed, Mar 7, 2012 at 5:17 PM, Charles R Harris charlesr.har...@gmail.com wrote: On Wed, Mar 7, 2012 at 9:35 AM, Pierre Haessig pierre.haes...@crans.org Coming back to Travis proposition bit-pattern approaches to missing data (*at least* for float64 and int32) need to be implemented., I wonder what is the amount of extra work to go from nafloat64 to nafloat32/16 ? Is there an hardware support NaN payloads with these smaller floats ? If not, or if it is too complicated, I feel it is acceptable to say it's too complicated and fall back to mask. One may have to choose between fancy types and fancy NAs... I'm in agreement here, and that was a major consideration in making a 'masked' implementation first. When it comes to missing data, bitpatterns can do everything that masks can do, are no more complicated to implement, and have better performance characteristics. Maybe for float, for other things, no. And we have lots of otherthings. The performance is a strawman, and it *isn't* easier to implement. Also, different folks adopt different values for 'missing' data, and distributing one or several masks along with the data is another common practice. True, but not really relevant to the current debate, because you have to handle such issues as part of your general data import workflow anyway, and none of these is any more complicated no matter which implementations are available. One inconvenience I have run into with the current API is that is should be easier to clear the mask from an ignored value without taking a new view or assigning known data. So maybe two types of masks (different payloads), or an additional flag could be helpful. The process of assigning masks could also be made a bit easier than using fancy indexing. So this, uh... this was actually the whole goal of the alterNEP design for masks -- making all this stuff easy for people (like you, apparently?) that want support for ignored values, separately from missing data, and want a nice clean API for it. Basically having a separate .mask attribute which was an ordinary, assignable array broadcastable to the attached array's shape. Nobody seemed interested in talking about it much then but maybe there's interest now? Come off it, Nathaniel, the problem is minor and fixable. The intent of the initial implementation was to discover such things. These things are less accessible with the current API *precisely* because of the feedback from R users. It didn't start that way. We now have something to evolve into what we want. That is a heck of a lot more useful than endless discussion. The endless discussion is for the following reason: - The discussion was never adequately resolved. The discussion was never adequately resolved because there was not enough work done to understand the various arguments. In particular, you've several times said things that indicate to me, as to Nathaniel, that you either have not read or have not understood the points that Nathaniel was making. Travis' recent email - to me - also indicates that there is still a genuine problem here that has not been adequately explored. There is no future in trying to stop discussion, and trying to do so will only prolong it and make it less useful. It will make the discussion - endless. If you want to help - read the alterNEP, respond to it directly, and further the discussion by engaged debate. Best, Matthew ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Missing data again
On 03/07/2012 09:26 AM, Nathaniel Smith wrote: On Wed, Mar 7, 2012 at 5:17 PM, Charles R Harris charlesr.har...@gmail.com wrote: On Wed, Mar 7, 2012 at 9:35 AM, Pierre Haessigpierre.haes...@crans.org Coming back to Travis proposition bit-pattern approaches to missing data (*at least* for float64 and int32) need to be implemented., I wonder what is the amount of extra work to go from nafloat64 to nafloat32/16 ? Is there an hardware support NaN payloads with these smaller floats ? If not, or if it is too complicated, I feel it is acceptable to say it's too complicated and fall back to mask. One may have to choose between fancy types and fancy NAs... I'm in agreement here, and that was a major consideration in making a 'masked' implementation first. When it comes to missing data, bitpatterns can do everything that masks can do, are no more complicated to implement, and have better performance characteristics. Also, different folks adopt different values for 'missing' data, and distributing one or several masks along with the data is another common practice. True, but not really relevant to the current debate, because you have to handle such issues as part of your general data import workflow anyway, and none of these is any more complicated no matter which implementations are available. One inconvenience I have run into with the current API is that is should be easier to clear the mask from an ignored value without taking a new view or assigning known data. So maybe two types of masks (different payloads), or an additional flag could be helpful. The process of assigning masks could also be made a bit easier than using fancy indexing. So this, uh... this was actually the whole goal of the alterNEP design for masks -- making all this stuff easy for people (like you, apparently?) that want support for ignored values, separately from missing data, and want a nice clean API for it. Basically having a separate .mask attribute which was an ordinary, assignable array broadcastable to the attached array's shape. Nobody seemed interested in talking about it much then but maybe there's interest now? In other words, good low-level support for numpy.ma functionality? With a migration path so that a separate numpy.ma might wither away? Yes, there is interest; this is exactly what I think is needed for my own style of applications (which I think are common at least in geoscience), and for matplotlib. The question is how to achieve it as simply and cleanly as possible while also satisfying the needs of the R users, and while making it easy for matplotlib, for example, to handle *any* reasonable input: ma, other masking, nan, or NA-bitpattern. It may be that a rather pragmatic approach to implementation will prove better than a highly idealized set of data models. Or, it may be that a dual approach is best, in which the flag value missing data implementation is tightly bound to the R model and the mask implementation is explicitly designed for the numpy.ma model. In any case, a reasonable level of agreement on the goals is needed. I presume Travis's involvement will facilitate a clarification of the goals and of the implementation; and I expect that much of Mark's work will end up serving well, even if much needs to be added and the API evolves considerably. Eric -- Nathaniel ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Missing data again
Hi, Le 07/03/2012 20:57, Eric Firing a écrit : In other words, good low-level support for numpy.ma functionality? Coming back to *existing* ma support, I was just wondering whether it was now possible to np.save a masked array. (I'm using numpy 1.5) In the end, this is the most annoying problem I have with the existing ma module which otherwise is pretty useful to me. I'm happy not to need to process 100% of my data though. Best, Pierre signature.asc Description: OpenPGP digital signature ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Missing data again
On 03/07/2012 11:15 AM, Pierre Haessig wrote: Hi, Le 07/03/2012 20:57, Eric Firing a écrit : In other words, good low-level support for numpy.ma functionality? Coming back to *existing* ma support, I was just wondering whether it was now possible to np.save a masked array. (I'm using numpy 1.5) No, not with the mask preserved. This is one of the improvements I am hoping for with the upcoming missing data work. Eric In the end, this is the most annoying problem I have with the existing ma module which otherwise is pretty useful to me. I'm happy not to need to process 100% of my data though. Best, Pierre ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Missing data again
On Wed, Mar 7, 2012 at 7:37 PM, Charles R Harris charlesr.har...@gmail.com wrote: On Wed, Mar 7, 2012 at 12:26 PM, Nathaniel Smith n...@pobox.com wrote: When it comes to missing data, bitpatterns can do everything that masks can do, are no more complicated to implement, and have better performance characteristics. Maybe for float, for other things, no. And we have lots of otherthings. It would be easier to discuss this if you'd, like, discuss :-(. If you know of some advantage that masks have over bitpatterns when it comes to missing data, can you please share it, instead of just asserting it? Not that I'm immune... I perhaps should have been more explicit myself, when I said performance characteristics, let me clarify that I was thinking of both speed (for floats) and memory (for most-but-not-all things). The performance is a strawman, How many users need to speak up to say that this is a serious problem they have with the current implementation before you stop calling it a strawman? Because when Wes says that it's not going to fly for his stats/econometics cases, and the neuroimaging folk like Gary and Matt say it's not going to fly for their use cases... surely just waving that away is a bit dismissive? I'm not saying that we *have* to implement bitpatterns because performance is *the most important feature* -- I'm just saying, well, what I said. For *missing data use* cases, bitpatterns have better performance characteristics than masks. If we decide that these use cases are important, then we should take this into account and weigh it against other considerations. Maybe what you think is that these use cases shouldn't be the focus of this feature and it should focus on the ignored use cases instead? That would be a legitimate argument... but if that's what you want to say, say it, don't just dismiss your users! and it *isn't* easier to implement. If I thought bitpatterns would be easier to implement, I would have said so... What I said was that they're not harder. You have some extra complexity, mostly in casting, and some reduced complexity -- no need to allocate and manipulate the mask. (E.g., simple same-type assignments and slicing require special casing for masks, but not for bitpatterns.) In many places the complexity is identical -- printing routines need to check for either special bitpatterns or masked values, whatever. Ufunc loops need to either find the appropriate part of the mask, or create a temporary mask buffer by calling a dtype func, whatever. On net they seem about equivalent, complexity-wise. ...I assume you disagree with this analysis, since I've said it before, wrote up a sketch for how the implementation would work at the C level, etc., and you continue to claim that simplicity is a compelling advantage for the masked approach. But I still don't know why you think that :-(. Also, different folks adopt different values for 'missing' data, and distributing one or several masks along with the data is another common practice. True, but not really relevant to the current debate, because you have to handle such issues as part of your general data import workflow anyway, and none of these is any more complicated no matter which implementations are available. One inconvenience I have run into with the current API is that is should be easier to clear the mask from an ignored value without taking a new view or assigning known data. So maybe two types of masks (different payloads), or an additional flag could be helpful. The process of assigning masks could also be made a bit easier than using fancy indexing. So this, uh... this was actually the whole goal of the alterNEP design for masks -- making all this stuff easy for people (like you, apparently?) that want support for ignored values, separately from missing data, and want a nice clean API for it. Basically having a separate .mask attribute which was an ordinary, assignable array broadcastable to the attached array's shape. Nobody seemed interested in talking about it much then but maybe there's interest now? Come off it, Nathaniel, the problem is minor and fixable. The intent of the initial implementation was to discover such things. Implementation can be wonderful, I absolutely agree. But you understand that I'd be more impressed by this example if your discovery weren't something I had been arguing for since before the implementation began :-). These things are less accessible with the current API *precisely* because of the feedback from R users. It didn't start that way. We now have something to evolve into what we want. That is a heck of a lot more useful than endless discussion. No, you are still missing the point completely! There is no what *we* want, because what you want is different than what I want. The masking stuff in the alterNEP was an attempt to give people like you who wanted ignored support what they wanted, and the bitpattern stuff was to
Re: [Numpy-discussion] Missing data again
On Wed, Mar 7, 2012 at 7:39 PM, Benjamin Root ben.r...@ou.edu wrote: On Wed, Mar 7, 2012 at 1:26 PM, Nathaniel Smith n...@pobox.com wrote: When it comes to missing data, bitpatterns can do everything that masks can do, are no more complicated to implement, and have better performance characteristics. Not true. bitpatterns inherently destroys the data, while masks do not. Yes, that's why I only wrote that this is true for missing data, not in general :-). If you have data that is being destroyed, then that's not missing data, by definition. We don't have consensus yet on whether that's the use case we are aiming for, but it's the one that Pierre was worrying about. For matplotlib, we can not use bitpatterns because it could over-write user data (or we have to copy the data). I would imagine other extension writers would have similar issues when they need to play around with input data in a safe manner. Right. You clearly need some sort of masking, either an explicit mask array that you keep somewhere, or one that gets attached to the underlying ndarray in some non-destructive way. Also, I doubt that the performance characteristics for strings and integers are the same as it is for masks. Not sure what you mean by this, but I'd be happy to hear more. -- Nathaniel ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Missing data again
Hi Mark, I went through the NA NEP a few days ago, but only too quickly so that my question is probably a rather dumb one. It's about the usability of bitpatter-based NAs, based on your recent post : Le 03/03/2012 22:46, Mark Wiebe a écrit : Also, here's a thought for the usability of NA-float64. As much as global state is a bad idea, something which determines whether implicit float dtypes are NA-float64 or float64 could help. In IPython, pylab mode would default to float64, and statlab or pystat would default to NA-float64. One way to write this might be: np.set_default_float(np.nafloat64) np.array([1.0, 2.0, 3.0]) array([ 1., 2., 3.], dtype=nafloat64) np.set_default_float(np.float64) np.array([1.0, 2.0, 3.0]) array([ 1., 2., 3.], dtype=float64) Q: Is is an *absolute* necessity to have two separate dtypes nafloatNN and floatNN to enable NA bitpattern storage ? From a potential user perspective, I feel it would be nice to have NA and non-NA cases look as similar as possible. Your code example is particularly striking : two different dtypes to store (from a user perspective) the exact same content ! If this *could* be avoided, it would be great... I don't know how the NA machinery is working R. Does it works with a kind of nafloat64 all the time or is there some type inference mechanics involved in choosing the appropriate type ? Best, Pierre signature.asc Description: OpenPGP digital signature ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Missing data again
Hi Pierre, On Tue, Mar 6, 2012 at 5:48 AM, Pierre Haessig pierre.haes...@crans.orgwrote: Hi Mark, I went through the NA NEP a few days ago, but only too quickly so that my question is probably a rather dumb one. It's about the usability of bitpatter-based NAs, based on your recent post : Le 03/03/2012 22:46, Mark Wiebe a écrit : Also, here's a thought for the usability of NA-float64. As much as global state is a bad idea, something which determines whether implicit float dtypes are NA-float64 or float64 could help. In IPython, pylab mode would default to float64, and statlab or pystat would default to NA-float64. One way to write this might be: np.set_default_float(np.nafloat64) np.array([1.0, 2.0, 3.0]) array([ 1., 2., 3.], dtype=nafloat64) np.set_default_float(np.float64) np.array([1.0, 2.0, 3.0]) array([ 1., 2., 3.], dtype=float64) Q: Is is an *absolute* necessity to have two separate dtypes nafloatNN and floatNN to enable NA bitpattern storage ? From a potential user perspective, I feel it would be nice to have NA and non-NA cases look as similar as possible. Your code example is particularly striking : two different dtypes to store (from a user perspective) the exact same content ! If this *could* be avoided, it would be great... The biggest reason to keep the two types separate is performance. The straight float dtypes map directly to hardware floating-point operations, which can be very fast. The NA-float dtypes have to use additional logic to handle the NA values correctly. NA is treated as a particular NaN, and if the hardware float operations were used directly, NA would turn into NaN. This additional logic usually means more branches, so is slower. One possibility we could consider is to automatically convert an array's dtype from float64 to nafloat64 the first time an NA is assigned. This would have good performance when there are no NAs, but would transparently switch on NA support when it's needed. I don't know how the NA machinery is working R. Does it works with a kind of nafloat64 all the time or is there some type inference mechanics involved in choosing the appropriate type ? My understanding of R is that it works with the nafloat64 for all its operations, yes. Cheers, Mark Best, Pierre ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Missing data again
On Sat, Mar 3, 2012 at 8:30 PM, Travis Oliphant tra...@continuum.io wrote: Hi all, Hi Travis, Thanks for bringing this back up. Have you looked at the summary from the last thread? https://github.com/njsmith/numpy/wiki/NA-discussion-status The goal was to try and at least work out what points we all *could* agree on, to have some common footing for further discussion. I won't copy the whole thing here, but I'd summarize the state as: -- It's pretty clear that there are two fairly different conceptual models/use cases in play here. For one of them (R-style missing data cases) it's pretty clear what the desired semantics would be. For the other (temporary ignored values) there's still substantive disagreement. -- We *haven't* yet established what we want numpy to actually support. IMHO the critical next step is this latter one -- maybe we want to fully support both use cases. Maybe it's really only one of them that's worth trying to support in the numpy core right now. Maybe it's just one of them, but it's worth doing so thoroughly that it should have multiple implementations. Or whatever. I fear that if we don't talk about these big picture questions and just wade directly back into round-and-round arguments about API details then we'll never get anywhere. [...] Because it is slated to go into release 1.7, we need to re-visit the masked array discussion again. The NEP process is the appropriate one and I'm glad we are taking that route for these discussions. My goal is to get consensus in order for code to get into NumPy (regardless of who writes the code). It may be that we don't come to a consensus (reasonable and intelligent people can disagree on things --- look at the coming election...). We can represent different parts of what is fortunately a very large user-base of NumPy users. First of all, I want to be clear that I think there is much great work that has been done in the current missing data code. There are some nice features in the where clause of the ufunc and the machinery for the iterator that allows re-using ufunc loops that are not re-written to check for missing data. I'm sure there are other things as well that I'm not quite aware of yet. However, I don't think the API presented to the numpy user presently is the correct one for NumPy 1.X. A few particulars: * the reduction operations need to default to skipna --- this is the most common use case which has been re-inforced again to me today by a new user to Python who is using masked arrays presently This is one of the points where the two conceptual models disagree (see also Skipper's point down-thread). If you have missing data, then propagation has to be the default -- the sum of 1, 2, and I-DON'T-KNOW-MAYBE-7-MAYBE-12 is not 3. If you have some data there but you've asked numpy to temporarily ignore it, then, well, duh, of course it should ignore it. * the mask needs to be visible to the user if they use that approach to missing data (people should be able to get a hold of the mask and work with it in Python) This is also a point where the two conceptual models disagree. Actually this is one of the original arguments we made against the NEP design -- that if you want missing data, then having a mask at all is counterproductive, and if you are ignoring data, then of course it should be easy to manipulate the ignore mask. The rationale for the current design is to compromise between these two approaches -- there is a mask, but it's hidden behind a curtain. Mostly. (This may be a compromise in the Solomonic sense.) * bit-pattern approaches to missing data (at least for float64 and int32) need to be implemented. * there should be some way when using masks (even if it's hidden from most users) for missing data to separate the low-level ufunc operation from the operation on the masks... I don't understand what this means. I have heard from several users that they will *not use the missing data* in NumPy as currently implemented, and I can now see why. For better or for worse, my approach to software is generally very user-driven and very pragmatic. On the other hand, I'm also a mathematician and appreciate the cognitive compression that can come out of well-formed structure. None-the-less, I'm an *applied* mathematician and am ultimately motivated by applications. I will get a hold of the NEP and spend some time with it to discuss some of this in that document. This will take several weeks (as PyCon is next week and I have a tutorial I'm giving there). For now, I do not think 1.7 can be released unless the masked array is labeled *experimental*. In project management terms, I see three options: 1) Put a big warning label on the functionality and leave it for now (If this option is given, np.asarray returns a masked array. NOTE: IN THE NEXT RELEASE, IT MAY INSTEAD RETURN A BAG OF
Re: [Numpy-discussion] Missing data again
On Tue, Mar 6, 2012 at 4:38 PM, Mark Wiebe mwwi...@gmail.com wrote: On Tue, Mar 6, 2012 at 5:48 AM, Pierre Haessig pierre.haes...@crans.org wrote: From a potential user perspective, I feel it would be nice to have NA and non-NA cases look as similar as possible. Your code example is particularly striking : two different dtypes to store (from a user perspective) the exact same content ! If this *could* be avoided, it would be great... The biggest reason to keep the two types separate is performance. The straight float dtypes map directly to hardware floating-point operations, which can be very fast. The NA-float dtypes have to use additional logic to handle the NA values correctly. NA is treated as a particular NaN, and if the hardware float operations were used directly, NA would turn into NaN. This additional logic usually means more branches, so is slower. Actually, no -- hardware float operations preserve NA-as-NaN. You might well need to be careful around more exotic code like optimized BLAS kernels, but all the basic ufuncs should Just Work at full speed. Demo: def hexify(x): return hex(np.float64(x).view(np.int64)) hexify(np.nan) '0x7ff8L' # IIRC this is R's NA bitpattern (presumably 1974 is someone's birthday) NA = np.int64(0x7ff8 + 1974).view(np.float64) # It is an NaN... NA nan # But it has a distinct bitpattern: hexify(NA) '0x7ff807b6L' # Like any NaN, it propagates through floating point operations: NA + 3 nan # But, critically, so does the bitpattern; ordinary Python + is returning NA on this operation: hexify(NA + 3) '0x7ff807b6L' This is how R does it, which is more evidence that this actually works on real hardware. There is one place where it fails. In a binary operation with *two* NaN values, there's an ambiguity about which payload should be returned. IEEE754 recommends just returning the first one. This means that NA + NaN = NA, NaN + NA = NaN. This is ugly, but it's an obscure case that nobody cares about, so it's probably worth it for the speed gain. (In fact, if you type those two expressions at the R prompt, then that's what you get, and I can't find any reference to anyone even noticing this.) I don't know how the NA machinery is working R. Does it works with a kind of nafloat64 all the time or is there some type inference mechanics involved in choosing the appropriate type ? My understanding of R is that it works with the nafloat64 for all its operations, yes. Right -- R has a very impoverished type system as compared to numpy. There's basically four types: numeric (meaning double precision float), integer, logical (boolean), and character (string). And in practice the integer type is essentially unused, because R parses numbers like 1 as being floating point, not integer; the only way to get an integer value is to explicitly cast to it. Each of these types has a specific bit-pattern set aside for representing NA. And... that's it. It's very simple when it works, but also very limited. I'm still skeptical that we could make the floating point types NA-aware by default -- until we have an implementation in hand, I'm nervous there'd be some corner case that broke everything. (Maybe ufuncs are fine but np.dot has an unavoidable overhead, or maybe it would mess up casting from float types to non-NA-aware types, etc.) But who knows. Probably not something we can really make a meaningful decision about yet. -- Nathaniel ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Missing data again
On Tue, Mar 6, 2012 at 9:25 PM, Nathaniel Smith n...@pobox.com wrote: On Sat, Mar 3, 2012 at 8:30 PM, Travis Oliphant tra...@continuum.io wrote: Hi all, Hi Travis, Thanks for bringing this back up. Have you looked at the summary from the last thread? https://github.com/njsmith/numpy/wiki/NA-discussion-status Re-reading that summary and the main documents and threads linked from it, I could find either examples of statistical software that treats missing and ignored data explicitly separately, or links to relevant literature. Those would probably help the discussion a lot. The goal was to try and at least work out what points we all *could* agree on, to have some common footing for further discussion. I won't copy the whole thing here, but I'd summarize the state as: -- It's pretty clear that there are two fairly different conceptual models/use cases in play here. For one of them (R-style missing data cases) it's pretty clear what the desired semantics would be. For the other (temporary ignored values) there's still substantive disagreement. -- We *haven't* yet established what we want numpy to actually support. IMHO the critical next step is this latter one -- maybe we want to fully support both use cases. Maybe it's really only one of them that's worth trying to support in the numpy core right now. Maybe it's just one of them, but it's worth doing so thoroughly that it should have multiple implementations. Or whatever. I fear that if we don't talk about these big picture questions and just wade directly back into round-and-round arguments about API details then we'll never get anywhere. [...] Because it is slated to go into release 1.7, we need to re-visit the masked array discussion again.The NEP process is the appropriate one and I'm glad we are taking that route for these discussions. My goal is to get consensus in order for code to get into NumPy (regardless of who writes the code).It may be that we don't come to a consensus (reasonable and intelligent people can disagree on things --- look at the coming election...). We can represent different parts of what is fortunately a very large user-base of NumPy users. First of all, I want to be clear that I think there is much great work that has been done in the current missing data code. There are some nice features in the where clause of the ufunc and the machinery for the iterator that allows re-using ufunc loops that are not re-written to check for missing data. I'm sure there are other things as well that I'm not quite aware of yet.However, I don't think the API presented to the numpy user presently is the correct one for NumPy 1.X. A few particulars: * the reduction operations need to default to skipna --- this is the most common use case which has been re-inforced again to me today by a new user to Python who is using masked arrays presently This is one of the points where the two conceptual models disagree (see also Skipper's point down-thread). If you have missing data, then propagation has to be the default -- the sum of 1, 2, and I-DON'T-KNOW-MAYBE-7-MAYBE-12 is not 3. If you have some data there but you've asked numpy to temporarily ignore it, then, well, duh, of course it should ignore it. * the mask needs to be visible to the user if they use that approach to missing data (people should be able to get a hold of the mask and work with it in Python) This is also a point where the two conceptual models disagree. Actually this is one of the original arguments we made against the NEP design -- that if you want missing data, then having a mask at all is counterproductive, and if you are ignoring data, then of course it should be easy to manipulate the ignore mask. The rationale for the current design is to compromise between these two approaches -- there is a mask, but it's hidden behind a curtain. Mostly. (This may be a compromise in the Solomonic sense.) * bit-pattern approaches to missing data (at least for float64 and int32) need to be implemented. * there should be some way when using masks (even if it's hidden from most users) for missing data to separate the low-level ufunc operation from the operation on the masks... I don't understand what this means. I have heard from several users that they will *not use the missing data* in NumPy as currently implemented, and I can now see why.For better or for worse, my approach to software is generally very user-driven and very pragmatic. On the other hand, I'm also a mathematician and appreciate the cognitive compression that can come out of well-formed structure.None-the-less, I'm an *applied* mathematician and am ultimately motivated by applications. I will get a hold of the NEP and spend some time with it to discuss some of this in that document. This will take several weeks (as PyCon is next week and I have a
Re: [Numpy-discussion] Missing data again
On Tue, Mar 6, 2012 at 9:14 PM, Ralf Gommers ralf.gomm...@googlemail.com wrote: On Tue, Mar 6, 2012 at 9:25 PM, Nathaniel Smith n...@pobox.com wrote: On Sat, Mar 3, 2012 at 8:30 PM, Travis Oliphant tra...@continuum.io wrote: Hi all, Hi Travis, Thanks for bringing this back up. Have you looked at the summary from the last thread? https://github.com/njsmith/numpy/wiki/NA-discussion-status Re-reading that summary and the main documents and threads linked from it, I could find either examples of statistical software that treats missing and ignored data explicitly separately, or links to relevant literature. Those would probably help the discussion a lot. (I think you mean couldn't find?) I'm not aware of any software that supports the IGNORED concept at all, whether in combination with missing data or not. np.ma is probably the closest example. I think we'd be breaking new ground there. This is also probably why it is less clear how it should work :-). IIUC, the basic reason that people want IGNORED in the core is that it provides convenience and syntactic sugar for efficient in place operation on subsets of large arrays. So there are actually two parts there -- the efficient operation, and the convenience/syntactic sugar. The key feature for efficient operation is the where= feature, which is not controversial at all. So, there's an argument that for now we should focus on where=, give people some time to work with it, and then use that experience to decide what kind of convenience/sugar would be useful, if any. But, that's just my own idea; I definitely can't claim any consensus on it. In project management terms, I see three options: 1) Put a big warning label on the functionality and leave it for now (If this option is given, np.asarray returns a masked array. NOTE: IN THE NEXT RELEASE, IT MAY INSTEAD RETURN A BAG OF RABID, HUNGRY WEASELS. NO GUARANTEES.) I've opened http://projects.scipy.org/numpy/ticket/2072 for that. Cool, thanks. Assuming we stick with this option, I'd appreciate it if you could check in the first beta that comes out whether or not the warnings are obvious enough and in all the right places. There probably won't be weasels though:) Of course. I've added myself to the CC list. (Err, if the beta won't be for a bit, though, then please remind me if you remember? I'm juggling a lot of balls right now.) 2) Move the code back out of mainline and into a branch until until there's consensus. 3) Hold up the release until this is all sorted. I come from the project-management school that says you should always have a releasable mainline, keep unready code in branches, and never hold up the release for features, so (2) seems obvious to me. While it may sound obvious, I hope you've understood why in practice it's not at all obvious and why you got such strong reactions to your proposal of taking out all that code. If not, just look at what happened with the numpy-refactor work. Of course, and that's why I'm not pressing the point. These trade-offs might be worth talking about at some point -- there are reasons that basically all the major FOSS projects have moved towards time-based releases :-) -- but that'd be a huge discussion at a time when we already have more than enough of those on our plate... But I seem to be very much in the minority on that[1], so oh well :-). I don't have any objection to (1), personally. (3) seems like a bad idea. Just my 2 pence. Agreed that (3) is a bad idea. +1 for (1). Ralf ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion Cheers, -- Nathaniel ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Missing data again
On Sat, Mar 3, 2012 at 1:30 PM, Travis Oliphant tra...@continuum.io wrote: Hi all, I've been thinking a lot about the masked array implementation lately. I finally had the time to look hard at what has been done and now am of the opinion that I do not think that 1.7 can be released with the current state of the masked array implementation *unless* it is clearly marked as experimental and may be changed in 1.8 That was the intention. I wish I had been able to be a bigger part of this conversation last year. But, that is why I took the steps I took to try and figure out another way to feed my family *and* stay involved in the NumPy community. I would love to stay involved in what is happening in the SciPy community, but I am more satisfied with what Ralf, Warren, Robert, Pauli, Josef, Charles, Stefan, and others are doing there right now, and don't have time to keep up with everything.Even though SciPy was the heart and soul of why I even got involved with Python for open source in the first place and took many years of my volunteer labor, I won't be able to spend significant time on SciPy code over the coming months. At some point, I really hope to be able to make contributions again to that code-base. Time will tell whether or not my aspirations will be realized. It depends quite a bit on whether or not my kids have what they need from me (which right now is money and time). NumPy, on the other hand, is not in a position where I can feel comfortable leaving my baby to others. I recognize and value the contributions from many people to make NumPy what it is today (e.g. code contributions, code rearrangement and standardization, build and install improvement, and most recently some architectural changes).But, I feel a personal responsibility for the code base as I spent a great many months writing NumPy in the first place, and I've spent a great deal of time interacting with NumPy users and feel like I have at least some sense of their stories.Of course, I built on the shoulders of giants, and much of what is there is *because of* where the code was adapted from (it was not created de-novo). Currently, there remains much that needs to be communicated, improved, and worked on, and I have specific opinions about what some changes and improvements should be, how they should be written, and how the resulting users need to be benefited. It will take time to discuss all of this, and that's where I will spend my open-source time in the coming months. In that vein: Because it is slated to go into release 1.7, we need to re-visit the masked array discussion again.The NEP process is the appropriate one and I'm glad we are taking that route for these discussions. My goal is to get consensus in order for code to get into NumPy (regardless of who writes the code).It may be that we don't come to a consensus (reasonable and intelligent people can disagree on things --- look at the coming election...). We can represent different parts of what is fortunately a very large user-base of NumPy users. First of all, I want to be clear that I think there is much great work that has been done in the current missing data code. There are some nice features in the where clause of the ufunc and the machinery for the iterator that allows re-using ufunc loops that are not re-written to check for missing data. I'm sure there are other things as well that I'm not quite aware of yet.However, I don't think the API presented to the numpy user presently is the correct one for NumPy 1.X. A few particulars: * the reduction operations need to default to skipna --- this is the most common use case which has been re-inforced again to me today by a new user to Python who is using masked arrays presently * the mask needs to be visible to the user if they use that approach to missing data (people should be able to get a hold of the mask and work with it in Python) * bit-pattern approaches to missing data (at least for float64 and int32) need to be implemented. * there should be some way when using masks (even if it's hidden from most users) for missing data to separate the low-level ufunc operation from the operation on the masks... Mind, Mark only had a few weeks to write code. I think the unfinished state is a direct function of that. I have heard from several users that they will *not use the missing data* in NumPy as currently implemented, and I can now see why.For better or for worse, my approach to software is generally very user-driven and very pragmatic. On the other hand, I'm also a mathematician and appreciate the cognitive compression that can come out of well-formed structure. None-the-less, I'm an *applied* mathematician and am ultimately motivated by applications. I think that would be Wes. I thought the current state wasn't that far away from what he wanted
Re: [Numpy-discussion] Missing data again
Mind, Mark only had a few weeks to write code. I think the unfinished state is a direct function of that. I have heard from several users that they will *not use the missing data* in NumPy as currently implemented, and I can now see why.For better or for worse, my approach to software is generally very user-driven and very pragmatic. On the other hand, I'm also a mathematician and appreciate the cognitive compression that can come out of well-formed structure. None-the-less, I'm an *applied* mathematician and am ultimately motivated by applications. I think that would be Wes. I thought the current state wasn't that far away from what he wanted in the only post where he was somewhat explicit. I think it would be useful for him to sit down with Mark at some time and thrash things out since I think there is some misunderstanding involved. Actually it wasn't Wes. It was 3 other people. I'm already well aware of Wes's perspective and actually think his concerns have been handled already. Also, the person who showed me their use-case was a new user. But, your point about getting people together is well-taken. I also recognize the fact that there have been (and likely continue to be) misunderstandings on multiple fronts. Fortunately, many of us will be at PyCon later this week. We tried really hard to get Mark Wiebe here this weekend as well --- but he could only sacrifice a week away from his degree work to join us for PyCon. It would be great if you could come to PyCon as well. Perhaps we can apply to NumFOCUS for a travel grant to bring NumPy developers together with other interested people to finish the masked array design and implementation. -Travis ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion