Re: [Numpy-discussion] Missing data wrap-up and request for comments

2012-05-10 Thread Scott Sinclair
On 11 May 2012 08:12, Fernando Perez  wrote:
> On Thu, May 10, 2012 at 11:03 PM, Scott Sinclair
>  wrote:
>> Having thought about it, a page on the website isn't a bad idea. I've
>> added a note pointing to this discussion. The document now appears at
>> http://numpy.scipy.org/NA-overview.html
>
> Why not have a separate repo for neps/discussion docs?  That way,
> people can be added to the team as they need to edit them and removed
> when done, and it's separate from the main site itself.  The site can
> simply have a link to this set of documents, which can be built,
> tracked, separately and cleanly.  We have more or less that setup with
> ipython for the site and docs:
>
> - main site page that points to the doc builds:
> http://ipython.org/documentation.html
> - doc builds on a secondary site:
> http://ipython.org/ipython-doc/stable/index.html

That's pretty much how things already work. The documentation is in
the main source tree and built docs end up at http://docs.scipy.org.
NEPs live at https://github.com/numpy/numpy/tree/master/doc/neps, but
don't get published outside of the source tree and there's no
"preferred" place for discussion documents.

> (assuming we'll have a nice website for numpy one day)

Ha ha ha ;-) Thanks for the thoughts and prodding.

Cheers,
Scott
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Missing data wrap-up and request for comments

2012-05-10 Thread Fernando Perez
On Thu, May 10, 2012 at 11:03 PM, Scott Sinclair
 wrote:
> Having thought about it, a page on the website isn't a bad idea. I've
> added a note pointing to this discussion. The document now appears at
> http://numpy.scipy.org/NA-overview.html

Why not have a separate repo for neps/discussion docs?  That way,
people can be added to the team as they need to edit them and removed
when done, and it's separate from the main site itself.  The site can
simply have a link to this set of documents, which can be built,
tracked, separately and cleanly.  We have more or less that setup with
ipython for the site and docs:

- main site page that points to the doc builds:
http://ipython.org/documentation.html
- doc builds on a secondary site:
http://ipython.org/ipython-doc/stable/index.html

This seems to me like the best way to separate the main web team
(assuming we'll have a nice website for numpy one day) from the team
that will edit documents of nep/discussion type.  I imagine the web
team will be fairly stable, where as the team for these docs will have
people coming and going.

Just a thought...  As usual, crib anything you find useful from our setup.

Cheers,

f
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Missing data wrap-up and request for comments

2012-05-10 Thread Scott Sinclair
On 11 May 2012 06:57, Travis Oliphant  wrote:
>
> On May 10, 2012, at 3:40 AM, Scott Sinclair wrote:
>
>> On 9 May 2012 18:46, Travis Oliphant  wrote:
>>> The document is available here:
>>>    https://github.com/numpy/numpy.scipy.org/blob/master/NA-overview.rst
>>
>> This is orthogonal to the discussion, but I'm curious as to why this
>> discussion document has landed in the website repo?
>>
>> I suppose it's not a really big deal, but future uploads of the
>> website will now include a page at
>> http://numpy.scipy.org/NA-overview.html with the content of this
>> document. If that's desirable, I'll add a note at the top of the
>> overview referencing this discussion thread. If not it can be
>> relocated somewhere more desirable after this thread's discussion
>> deadline expires..
>
> Yes, it can be relocated.   Can you suggest where it should go?  It was added 
> there so that nathaniel and mark could both edit it together with Nathaniel 
> added to the web-team.
>
> It may not be a bad place for it, though.   At least for a while.

Having thought about it, a page on the website isn't a bad idea. I've
added a note pointing to this discussion. The document now appears at
http://numpy.scipy.org/NA-overview.html

Cheers,
Scott
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] NA-mask interactions with existing C code

2012-05-10 Thread Dag Sverre Seljebotn
On 05/11/2012 07:36 AM, Travis Oliphant wrote:
>>>
>>> I guess this mixture of Python-API and C-API is different from the way
>>> the API tries to protect incorrect access. From the Python API, it.
>>> should let everything through, because it's for Python code to use. From
>>> the C API, it should default to not letting things through, because
>>> special NA-mask aware code needs to be written. I'm not sure if there is
>>> a reasonable approach here which works for everything.
>>
>> Does that mean you consider changing ob_type for masked arrays
>> unreasonable? They can still use the same object struct...
>>
>>>
>>> But in general, I will often be lazy and just do
>>>
>>> def f(np.ndarray arr):
>>>  c_func(np.PyArray_DATA(arr))
>>>
>>> It's an exception if you don't provide an array -- so who cares. (I
>>> guess the odds of somebody feeding a masked array to code like that,
>>> which doesn't try to be friendly, is relatively smaller though.)
>>>
>>>
>>> This code would already fail with non-contiguous strides or byte-swapped
>>> data, so the additional NA mask case seems to fit in an already-failing
>>> category.
>>
>> Honestly! I hope you did't think I provided a full-fledged example?
>> Perhaps you'd like to point out to me that "c_func" is a bad name for a
>> function as well?
>>
>> One would of course check that things are contiguous (or pass on the
>> strides), check the dtype and dispatch to different C functions in each
>> case, etc.
>>
>> But that isn't the point. Scientific code most of the time does fall in
>> the "already-failing" category. That doesn't mean it doesn't count.
>> Let's focus on the number of code lines written and developer hours that
>> will be spent cleaning up the mess -- not the "validity" of the code in
>> question.
>>
>>>
>>>
>>> If you know the datatype, you can really do
>>>
>>> def f(np.ndarray[double] arr):
>>>  c_func(&arr[0])
>>>
>>> which works with PEP 3118. But I use PyArray_DATA out of habit (and
>>> since it works in the cases without dtype).
>>>
>>> Frankly, I don't expect any Cython code to do the right thing here;
>>> calling PyArray_FromAny is much more typing. And really, nobody ever
>>> questioned that if we had an actual ndarray instance, we'd be allowed to
>>> call PyArray_DATA.
>>>
>>> I don't know how much Cython code is out there in the wild for which
>>> this is a problem. Either way, it would cause something of a reeducation
>>> challenge for Cython users.
>>>
>>>
>>> Since this style of coding already has known problems, do you think the
>>> case with NA-masks deserves more attention here? What will happen is.
>>> access to array element data without consideration of the mask, which
>>> seems similar in nature to accessing array data with the wrong stride or
>>> byte order.
>>
>> I don't agree with the premise of that paragraph. There's no reason to
>> assume that just because code doesn't call FromAny, it has problems.
>> (And I'll continue to assume that whatever array is returned from
>> "np.ascontiguousarray is really contiguous...)
>>
>> Whether it requires attention or not is a different issue though. I'm
>> not sure. I think other people should weigh in on that -- I mostly write
>> code for my own consumption.
>>
>> One should at least check pandas, scikits-image, scikits-learn, mpi4py,
>> petsc4py, and so on. And ask on the Cython users list. Hopefully it will
>> usually be PEP 3118. But now I need to turn in.
>>
>> Travis, would such a survey be likely to affect the outcome of your
>> decision in any way? Or should we just leave this for now?
>>
>
> This dialog gets at the heart of the matter, I think.   The NEP seems to want 
> NumPy to have a "better" API that always protects downstream users from 
> understanding what is actually under the covers.   It would prefer to push 
> NumPy in the direction of an array object that is fundamentally more opaque.  
>  However, the world NumPy lives in is decidedly not opaque.   There has been 
> significant education and shared understanding of what a NumPy array actually 
> *is* (a strided view of memory of a particular "dtype").   This shared 
> understanding has even been pushed into Python as the buffer protocol.It 
> is very common for extension modules to go directly to the data they want by 
> using this understanding.
>
> This is very different from the traditional "shield your users" from how 
> things are actually done view of most object APIs.It was actually 
> intentional.  I'm not saying that different choices could not have been 
> made or that some amount of shielding should never be contemplated.   I'm 
> just saying that NumPy has been used as a nice bridge between the world of 
> scientific computing codes that have chunks of memory allocated for 
> processing and high-level code.   Part of the reason for this bridge has been 
> the simple object model.
>
> I just don't think the NEP fully 

Re: [Numpy-discussion] NA-mask interactions with existing C code

2012-05-10 Thread Travis Oliphant
>> 
>> I guess this mixture of Python-API and C-API is different from the way
>> the API tries to protect incorrect access. From the Python API, it.
>> should let everything through, because it's for Python code to use. From
>> the C API, it should default to not letting things through, because
>> special NA-mask aware code needs to be written. I'm not sure if there is
>> a reasonable approach here which works for everything.
> 
> Does that mean you consider changing ob_type for masked arrays 
> unreasonable? They can still use the same object struct...
> 
>> 
>>But in general, I will often be lazy and just do
>> 
>>def f(np.ndarray arr):
>> c_func(np.PyArray_DATA(arr))
>> 
>>It's an exception if you don't provide an array -- so who cares. (I
>>guess the odds of somebody feeding a masked array to code like that,
>>which doesn't try to be friendly, is relatively smaller though.)
>> 
>> 
>> This code would already fail with non-contiguous strides or byte-swapped
>> data, so the additional NA mask case seems to fit in an already-failing
>> category.
> 
> Honestly! I hope you did't think I provided a full-fledged example? 
> Perhaps you'd like to point out to me that "c_func" is a bad name for a 
> function as well?
> 
> One would of course check that things are contiguous (or pass on the 
> strides), check the dtype and dispatch to different C functions in each 
> case, etc.
> 
> But that isn't the point. Scientific code most of the time does fall in 
> the "already-failing" category. That doesn't mean it doesn't count. 
> Let's focus on the number of code lines written and developer hours that 
> will be spent cleaning up the mess -- not the "validity" of the code in 
> question.
> 
>> 
>> 
>>If you know the datatype, you can really do
>> 
>>def f(np.ndarray[double] arr):
>> c_func(&arr[0])
>> 
>>which works with PEP 3118. But I use PyArray_DATA out of habit (and
>>since it works in the cases without dtype).
>> 
>>Frankly, I don't expect any Cython code to do the right thing here;
>>calling PyArray_FromAny is much more typing. And really, nobody ever
>>questioned that if we had an actual ndarray instance, we'd be allowed to
>>call PyArray_DATA.
>> 
>>I don't know how much Cython code is out there in the wild for which
>>this is a problem. Either way, it would cause something of a reeducation
>>challenge for Cython users.
>> 
>> 
>> Since this style of coding already has known problems, do you think the
>> case with NA-masks deserves more attention here? What will happen is.
>> access to array element data without consideration of the mask, which
>> seems similar in nature to accessing array data with the wrong stride or
>> byte order.
> 
> I don't agree with the premise of that paragraph. There's no reason to 
> assume that just because code doesn't call FromAny, it has problems. 
> (And I'll continue to assume that whatever array is returned from 
> "np.ascontiguousarray is really contiguous...)
> 
> Whether it requires attention or not is a different issue though. I'm 
> not sure. I think other people should weigh in on that -- I mostly write 
> code for my own consumption.
> 
> One should at least check pandas, scikits-image, scikits-learn, mpi4py, 
> petsc4py, and so on. And ask on the Cython users list. Hopefully it will 
> usually be PEP 3118. But now I need to turn in.
> 
> Travis, would such a survey be likely to affect the outcome of your 
> decision in any way? Or should we just leave this for now?
> 

This dialog gets at the heart of the matter, I think.   The NEP seems to want 
NumPy to have a "better" API that always protects downstream users from 
understanding what is actually under the covers.   It would prefer to push 
NumPy in the direction of an array object that is fundamentally more opaque.   
However, the world NumPy lives in is decidedly not opaque.   There has been 
significant education and shared understanding of what a NumPy array actually 
*is* (a strided view of memory of a particular "dtype").   This shared 
understanding has even been pushed into Python as the buffer protocol.It is 
very common for extension modules to go directly to the data they want by using 
this understanding.

This is very different from the traditional "shield your users" from how things 
are actually done view of most object APIs.It was actually intentional. 
 I'm not saying that different choices could not have been made or that some 
amount of shielding should never be contemplated.   I'm just saying that NumPy 
has been used as a nice bridge between the world of scientific computing codes 
that have chunks of memory allocated for processing and high-level code.   Part 
of the reason for this bridge has been the simple object model.  

I just don't think the NEP fully appreciates just how fundamental of a shift 
this is in the wider NumPy community and it is not something that can be done 
immediately or wi

Re: [Numpy-discussion] Missing data wrap-up and request for comments

2012-05-10 Thread Travis Oliphant

On May 10, 2012, at 12:21 AM, Charles R Harris wrote:

> 
> 
> On Wed, May 9, 2012 at 11:05 PM, Benjamin Root  wrote:
> 
> 
> On Wednesday, May 9, 2012, Nathaniel Smith wrote:
> 
> 
> My only objection to this proposal is that committing to this approach
> seems premature. The existing masked array objects act quite
> differently from numpy.ma, so why do you believe that they're a good
> foundation for numpy.ma, and why will users want to switch to their
> semantics over numpy.ma's semantics? These aren't rhetorical
> questions, it seems like they must have concrete answers, but I don't
> know what they are.
> 
> Based on the design decisions made in the original NEP, a re-made numpy.ma 
> would have to lose _some_ features particularly, the ability to share masks. 
> Save for that and some very obscure behaviors that are undocumented, it is 
> possible to remake numpy.ma as a compatibility layer.
> 
> That being said, I think that there are some fundamental questions that has 
> concerned. If I recall, there were unresolved questions about behaviors 
> surrounding assignments to elements of a view.
> 
> I see the project as broken down like this:
> 1.) internal architecture (largely abi issues)
> 2.) external architecture (hooks throughout numpy to utilize the new features 
> where possible such as where= argument)
> 3.) getter/setter semantics
> 4.) mathematical semantics
> 
> At this moment, I think we have pieces of 2 and they are fairly 
> non-controversial. It is 1 that I see as being the immediate hold-up here. 3 
> & 4 are non-trivial, but because they are mostly about interfaces, I think we 
> can be willing to accept some very basic, fundamental, barebones components 
> here in order to lay the groundwork for a more complete API later.
> 
> To talk of Travis's proposal, doing nothing is no-go. Not moving forward 
> would dishearten the community. Making a ndmasked type is very intriguing. I 
> see it as a set towards eventually deprecating ndarray? Also, how would it 
> behave with no.asarray() and no.asanyarray()? My other concern is a possible 
> violation of DRY. How difficult would it be to maintain two ndarrays in 
> parallel?  
> 
> As for the flag approach, this still doesn't solve the problem of legacy code 
> (or did I misunderstand?)
> 
> My understanding of the flag is to allow the code to stay in and get reworked 
> and experimented with while keeping it from contaminating conventional use.
> 
> The whole point of putting the code in was to experiment and adjust. The 
> rather bizarre idea that it needs to be perfect from the get go is 
> disheartening, and is seldom how new things get developed. Sure, there is a 
> plan up front, but there needs to be feedback and change. And in fact, I 
> haven't seen much feedback about the actual code, I don't even know that the 
> people complaining have tried using it to see where it hurts. I'd like that 
> sort of feedback.
> 

I don't think anyone is saying it needs to be perfect from the get go.What 
I am saying is that this is fundamental enough to downstream users that this 
kind of thing is best done as a separate object.  The flag could still be used 
to make all Python-level array constructors build ndmasked objects.  

But, this doesn't address the C-level story where there is quite a bit of 
downstream use where people have used the NumPy array as just a pointer to 
memory without considering that there might be a mask attached that should be 
inspected as well. 

The NEP addresses this a little bit for those C or C++ consumers of the ndarray 
in C who always use PyArray_FromAny which can fail if the array has non-NULL 
mask contents.   However, it is *not* true that all downstream users use 
PyArray_FromAny. 

A large number of users just use something like PyArray_Check and then 
PyArray_DATA to get the pointer to the data buffer and then go from there 
thinking of their data as a strided memory chunk only (no extra mask).The 
NEP fundamentally changes this simple invariant that has been in NumPy and 
Numeric before it for a long, long time. 

I really don't see how we can do this in a 1.7 release.It has too many 
unknown and I think unknowable downstream effects.But, I think we could 
introduce another arrayobject that is the masked_array with a Python-level flag 
that makes it the default array in Python. 

There are a few more subtleties,  PyArray_Check by default will pass 
sub-classes so if the new ndmask array were a sub-class then it would be passed 
(just like current numpy.ma arrays and matrices would pass that check today).   
 However, there is a PyArray_CheckExact macro which could be used to ensure the 
object was actually of PyArray_Type.   There is also the PyArg_ParseTuple 
command with "O!" that I have seen used many times to ensure an exact NumPy 
array.  

-Travis






> Chuck
> 
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http:/

Re: [Numpy-discussion] Missing data wrap-up and request for comments

2012-05-10 Thread Travis Oliphant

On May 10, 2012, at 3:40 AM, Scott Sinclair wrote:

> On 9 May 2012 18:46, Travis Oliphant  wrote:
>> The document is available here:
>>https://github.com/numpy/numpy.scipy.org/blob/master/NA-overview.rst
> 
> This is orthogonal to the discussion, but I'm curious as to why this
> discussion document has landed in the website repo?
> 
> I suppose it's not a really big deal, but future uploads of the
> website will now include a page at
> http://numpy.scipy.org/NA-overview.html with the content of this
> document. If that's desirable, I'll add a note at the top of the
> overview referencing this discussion thread. If not it can be
> relocated somewhere more desirable after this thread's discussion
> deadline expires..

Yes, it can be relocated.   Can you suggest where it should go?  It was added 
there so that nathaniel and mark could both edit it together with Nathaniel 
added to the web-team. 

It may not be a bad place for it, though.   At least for a while. 

-Travis


> 
> Cheers,
> Scott
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Missing data wrap-up and request for comments

2012-05-10 Thread Matthew Brett
Hi,

On Thu, May 10, 2012 at 2:43 AM, Nathaniel Smith  wrote:
> Hi Matthew,
>
> On Thu, May 10, 2012 at 12:01 AM, Matthew Brett  
> wrote:
>>> The third proposal is certainly the best one from Cython's perspective;
>>> and I imagine for those writing C extensions against the C API too.
>>> Having PyType_Check fail for ndmasked is a very good way of having code
>>> fail that is not written to take masks into account.
>>
>> Mark, Nathaniel - can you comment how your chosen approaches would
>> interact with extension code?
>>
>> I'm guessing the bitpattern dtypes would be expected to cause
>> extension code to choke if the type is not supported?
>
> That's pretty much how I'm imagining it, yes. Right now if you have,
> say, a Cython function like
>
> cdef f(np.ndarray[double] a):
>    ...
>
> and you do f(np.zeros(10, dtype=int)), then it will error out, because
> that function doesn't know how to handle ints, only doubles. The same
> would apply for, say, a NA-enabled integer. In general there are
> almost arbitrarily many dtypes that could get passed into any function
> (including user-defined ones, etc.), so C code already has to check
> dtypes for correctness.
>
> Second order issues:
> - There is certainly C code out there that just assumes that it will
> only be passed an array with certain dtype (and ndim, memory layout,
> etc...). If you write such C code then it's your job to make sure that
> you only pass it the kinds of arrays that it expects, just like now
> :-).
>
> - We may want to do some sort of special-casing of handling for
> floating point NA dtypes that use an NaN as the "magic" bitpattern,
> since many algorithms *will* work with these unchanged, and it might
> be frustrating to have to wait for every extension module to be
> updated just to allow for this case explicitly before using them. OTOH
> you can easily work around this. Like say my_qr is a legacy C function
> that will in fact propagate NaNs correctly, so float NA dtypes would
> Just Work -- except, it errors out at the start because it doesn't
> recognize the dtype. How annoying. We *could* have some special hack
> you can use to force it to work anyway (by like making the "is this
> the dtype I expect?" routine lie.) But you can also just do:
>
>  def my_qr_wrapper(arr):
>    if arr.dtype is a NA float dtype with NaN magic value:
>      result = my_qr(arr.view(arr.dtype.base_dtype))
>      return result.view(arr.dtype)
>    else:
>      return my_qr(arr)
>
> and hey presto, now it will correctly pass through NAs. So perhaps
> it's not worth bothering with special hacks.
>
> - Of course if  your extension function does want to handle NAs
> generically, then there will be a simple C api for checking for them,
> setting them, etc. Numpy needs such an API internally anyway!

Thanks for this.

Mark - in view of the discussions about Cython and extension code -
could you say what you see as disadvantages to the ndmasked subclass
proposal?

Cheers,

Matthew
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] NA-mask interactions with existing C code

2012-05-10 Thread Dag Sverre Seljebotn


Dag Sverre Seljebotn  wrote:

>On 05/11/2012 01:06 AM, Mark Wiebe wrote:
>> On Thu, May 10, 2012 at 5:47 PM, Dag Sverre Seljebotn
>> mailto:d.s.seljeb...@astro.uio.no>>
>wrote:
>>
>> On 05/11/2012 12:28 AM, Mark Wiebe wrote:
>>  > I did some searching for typical Cython and C code which
>accesses
>> numpy
>>  > arrays, and added a section to the NEP describing how they
>behave
>> in the
>>  > current implementation. Cython code which uses either straight
>Python
>>  > access or the buffer protocol is fine (after a bugfix in
>numpy, it
>>  > wasn't failing currently as it should in the pep3118 case). C
>> code which
>>  > follows the recommended practice of using PyArray_FromAny or
>one
>> of the
>>  > related macros is also fine, because these functions have been
>> made to
>>  > fail on NA-masked arrays unless the flag NPY_ARRAY_ALLOWNA is
>> provided.
>>  >
>>  > In general, code which follows the recommended numpy practices
>will
>>  > raise exceptions when encountering NA-masked arrays. This
>means
>>  > programmers don't have to worry about the NA unless they want
>to
>> support
>>  > it. Having things go through PyArray_FromAny also provides a
>> place where
>>  > lazy evaluation arrays could be evaluated, and other similar
>> potential
>>  > future extensions can use to provide compatibility.
>>  >
>>  > Here's the section I added to the NEP:
>>  >
>>  > Interaction With Pre-existing C API Usage
>>  > =
>>  >
>>  > Making sure existing code using the C API, whether it's
>written
>> in C, C++,
>>  > or Cython, does something reasonable is an important goal of
>this
>>  > implementation.
>>  > The general strategy is to make existing code which does not
>> explicitly
>>  > tell numpy it supports NA masks fail with an exception saying
>so.
>> There are
>>  > a few different access patterns people use to get ahold of the
>numpy
>>  > array data,
>>  > here we examine a few of them to see what numpy can do. These
>> examples are
>>  > found from doing google searches of numpy C API array access.
>>  >
>>  > Numpy Documentation - How to extend NumPy
>>  > -
>>  >
>>  >
>>
>http://docs.scipy.org/doc/numpy/user/c-info.how-to-extend.html#dealing-with-array-objects
>>  >
>>  > This page has a section "Dealing with array objects" which has
>some
>>  > advice for how
>>  > to access numpy arrays from C. When accepting arrays, the
>first
>> step it
>>  > suggests is
>>  > to use PyArray_FromAny or a macro built on that function, so
>code
>>  > following this
>>  > advice will properly fail when given an NA-masked array it
>> doesn't know
>>  > how to handle.
>>  >
>>  > The way this is handled is that PyArray_FromAny requires a
>> special flag,
>>  > NPY_ARRAY_ALLOWNA,
>>  > before it will allow NA-masked arrays to flow through.
>>  >
>>  >
>>
>http://docs.scipy.org/doc/numpy/reference/c-api.array.html#NPY_ARRAY_ALLOWNA
>>  >
>>  > Code which does not follow this advice, and instead just calls
>>  > PyArray_Check() to verify
>>  > its an ndarray and checks some flags, will silently produce
>incorrect
>>  > results. This style
>>  > of code does not provide any opportunity for numpy to say
>"hey, this
>>  > array is special",
>>  > so also is not compatible with future ideas of lazy
>evaluation,
>> derived
>>  > dtypes, etc.
>>
>> This doesn't really cover the Cython code I write that interfaces
>with C
>> (and probably the code others write in Cython).
>>
>> Often I'd do:
>>
>> def f(arg):
>>  cdef np.ndarray arr = np.asarray(arg)
>>  c_func(np.PyArray_DATA(arr))
>>
>> So I mix Python np.asarray with C PyArray_DATA. In general, I
>think you
>> use PyArray_FromAny if you're very concerned about performance or
>need
>> some special flag, but it's certainly not the first thing you
>tgry.
>>
>>
>> I guess this mixture of Python-API and C-API is different from the
>way
>> the API tries to protect incorrect access. From the Python API, it.
>> should let everything through, because it's for Python code to use.
>From
>> the C API, it should default to not letting things through, because
>> special NA-mask aware code needs to be written. I'm not sure if there
>is
>> a reasonable approach here which works for everything.
>
>Does that mean you consider changing ob_type for masked arrays 
>unreasonable? They can still use the same object struct...
>
>>
>> But in general, I will often be lazy and just do
>>
>> def f(np.ndarray arr):
>>  c_func(np.PyArray_DATA(arr))
>>
>> It's an exception if you don't provide an array -- so who cares.
>(I
>> guess the odds of somebody

Re: [Numpy-discussion] NA-mask interactions with existing C code

2012-05-10 Thread Dag Sverre Seljebotn


Dag Sverre Seljebotn  wrote:

>On 05/11/2012 01:06 AM, Mark Wiebe wrote:
>> On Thu, May 10, 2012 at 5:47 PM, Dag Sverre Seljebotn
>> mailto:d.s.seljeb...@astro.uio.no>>
>wrote:
>>
>> On 05/11/2012 12:28 AM, Mark Wiebe wrote:
>>  > I did some searching for typical Cython and C code which
>accesses
>> numpy
>>  > arrays, and added a section to the NEP describing how they
>behave
>> in the
>>  > current implementation. Cython code which uses either straight
>Python
>>  > access or the buffer protocol is fine (after a bugfix in
>numpy, it
>>  > wasn't failing currently as it should in the pep3118 case). C
>> code which
>>  > follows the recommended practice of using PyArray_FromAny or
>one
>> of the
>>  > related macros is also fine, because these functions have been
>> made to
>>  > fail on NA-masked arrays unless the flag NPY_ARRAY_ALLOWNA is
>> provided.
>>  >
>>  > In general, code which follows the recommended numpy practices
>will
>>  > raise exceptions when encountering NA-masked arrays. This
>means
>>  > programmers don't have to worry about the NA unless they want
>to
>> support
>>  > it. Having things go through PyArray_FromAny also provides a
>> place where
>>  > lazy evaluation arrays could be evaluated, and other similar
>> potential
>>  > future extensions can use to provide compatibility.
>>  >
>>  > Here's the section I added to the NEP:
>>  >
>>  > Interaction With Pre-existing C API Usage
>>  > =
>>  >
>>  > Making sure existing code using the C API, whether it's
>written
>> in C, C++,
>>  > or Cython, does something reasonable is an important goal of
>this
>>  > implementation.
>>  > The general strategy is to make existing code which does not
>> explicitly
>>  > tell numpy it supports NA masks fail with an exception saying
>so.
>> There are
>>  > a few different access patterns people use to get ahold of the
>numpy
>>  > array data,
>>  > here we examine a few of them to see what numpy can do. These
>> examples are
>>  > found from doing google searches of numpy C API array access.
>>  >
>>  > Numpy Documentation - How to extend NumPy
>>  > -
>>  >
>>  >
>>
>http://docs.scipy.org/doc/numpy/user/c-info.how-to-extend.html#dealing-with-array-objects
>>  >
>>  > This page has a section "Dealing with array objects" which has
>some
>>  > advice for how
>>  > to access numpy arrays from C. When accepting arrays, the
>first
>> step it
>>  > suggests is
>>  > to use PyArray_FromAny or a macro built on that function, so
>code
>>  > following this
>>  > advice will properly fail when given an NA-masked array it
>> doesn't know
>>  > how to handle.
>>  >
>>  > The way this is handled is that PyArray_FromAny requires a
>> special flag,
>>  > NPY_ARRAY_ALLOWNA,
>>  > before it will allow NA-masked arrays to flow through.
>>  >
>>  >
>>
>http://docs.scipy.org/doc/numpy/reference/c-api.array.html#NPY_ARRAY_ALLOWNA
>>  >
>>  > Code which does not follow this advice, and instead just calls
>>  > PyArray_Check() to verify
>>  > its an ndarray and checks some flags, will silently produce
>incorrect
>>  > results. This style
>>  > of code does not provide any opportunity for numpy to say
>"hey, this
>>  > array is special",
>>  > so also is not compatible with future ideas of lazy
>evaluation,
>> derived
>>  > dtypes, etc.
>>
>> This doesn't really cover the Cython code I write that interfaces
>with C
>> (and probably the code others write in Cython).
>>
>> Often I'd do:
>>
>> def f(arg):
>>  cdef np.ndarray arr = np.asarray(arg)
>>  c_func(np.PyArray_DATA(arr))
>>
>> So I mix Python np.asarray with C PyArray_DATA. In general, I
>think you
>> use PyArray_FromAny if you're very concerned about performance or
>need
>> some special flag, but it's certainly not the first thing you
>tgry.
>>
>>
>> I guess this mixture of Python-API and C-API is different from the
>way
>> the API tries to protect incorrect access. From the Python API, it.
>> should let everything through, because it's for Python code to use.
>From
>> the C API, it should default to not letting things through, because
>> special NA-mask aware code needs to be written. I'm not sure if there
>is
>> a reasonable approach here which works for everything.
>
>Does that mean you consider changing ob_type for masked arrays 
>unreasonable? They can still use the same object struct...
>
>>
>> But in general, I will often be lazy and just do
>>
>> def f(np.ndarray arr):
>>  c_func(np.PyArray_DATA(arr))
>>
>> It's an exception if you don't provide an array -- so who cares.
>(I
>> guess the odds of somebody

Re: [Numpy-discussion] NA-mask interactions with existing C code

2012-05-10 Thread Dag Sverre Seljebotn
On 05/11/2012 01:06 AM, Mark Wiebe wrote:
> On Thu, May 10, 2012 at 5:47 PM, Dag Sverre Seljebotn
> mailto:d.s.seljeb...@astro.uio.no>> wrote:
>
> On 05/11/2012 12:28 AM, Mark Wiebe wrote:
>  > I did some searching for typical Cython and C code which accesses
> numpy
>  > arrays, and added a section to the NEP describing how they behave
> in the
>  > current implementation. Cython code which uses either straight Python
>  > access or the buffer protocol is fine (after a bugfix in numpy, it
>  > wasn't failing currently as it should in the pep3118 case). C
> code which
>  > follows the recommended practice of using PyArray_FromAny or one
> of the
>  > related macros is also fine, because these functions have been
> made to
>  > fail on NA-masked arrays unless the flag NPY_ARRAY_ALLOWNA is
> provided.
>  >
>  > In general, code which follows the recommended numpy practices will
>  > raise exceptions when encountering NA-masked arrays. This means
>  > programmers don't have to worry about the NA unless they want to
> support
>  > it. Having things go through PyArray_FromAny also provides a
> place where
>  > lazy evaluation arrays could be evaluated, and other similar
> potential
>  > future extensions can use to provide compatibility.
>  >
>  > Here's the section I added to the NEP:
>  >
>  > Interaction With Pre-existing C API Usage
>  > =
>  >
>  > Making sure existing code using the C API, whether it's written
> in C, C++,
>  > or Cython, does something reasonable is an important goal of this
>  > implementation.
>  > The general strategy is to make existing code which does not
> explicitly
>  > tell numpy it supports NA masks fail with an exception saying so.
> There are
>  > a few different access patterns people use to get ahold of the numpy
>  > array data,
>  > here we examine a few of them to see what numpy can do. These
> examples are
>  > found from doing google searches of numpy C API array access.
>  >
>  > Numpy Documentation - How to extend NumPy
>  > -
>  >
>  >
> 
> http://docs.scipy.org/doc/numpy/user/c-info.how-to-extend.html#dealing-with-array-objects
>  >
>  > This page has a section "Dealing with array objects" which has some
>  > advice for how
>  > to access numpy arrays from C. When accepting arrays, the first
> step it
>  > suggests is
>  > to use PyArray_FromAny or a macro built on that function, so code
>  > following this
>  > advice will properly fail when given an NA-masked array it
> doesn't know
>  > how to handle.
>  >
>  > The way this is handled is that PyArray_FromAny requires a
> special flag,
>  > NPY_ARRAY_ALLOWNA,
>  > before it will allow NA-masked arrays to flow through.
>  >
>  >
> 
> http://docs.scipy.org/doc/numpy/reference/c-api.array.html#NPY_ARRAY_ALLOWNA
>  >
>  > Code which does not follow this advice, and instead just calls
>  > PyArray_Check() to verify
>  > its an ndarray and checks some flags, will silently produce incorrect
>  > results. This style
>  > of code does not provide any opportunity for numpy to say "hey, this
>  > array is special",
>  > so also is not compatible with future ideas of lazy evaluation,
> derived
>  > dtypes, etc.
>
> This doesn't really cover the Cython code I write that interfaces with C
> (and probably the code others write in Cython).
>
> Often I'd do:
>
> def f(arg):
>  cdef np.ndarray arr = np.asarray(arg)
>  c_func(np.PyArray_DATA(arr))
>
> So I mix Python np.asarray with C PyArray_DATA. In general, I think you
> use PyArray_FromAny if you're very concerned about performance or need
> some special flag, but it's certainly not the first thing you tgry.
>
>
> I guess this mixture of Python-API and C-API is different from the way
> the API tries to protect incorrect access. From the Python API, it.
> should let everything through, because it's for Python code to use. From
> the C API, it should default to not letting things through, because
> special NA-mask aware code needs to be written. I'm not sure if there is
> a reasonable approach here which works for everything.

Does that mean you consider changing ob_type for masked arrays 
unreasonable? They can still use the same object struct...

>
> But in general, I will often be lazy and just do
>
> def f(np.ndarray arr):
>  c_func(np.PyArray_DATA(arr))
>
> It's an exception if you don't provide an array -- so who cares. (I
> guess the odds of somebody feeding a masked array to code like that,
> which doesn't try to be friendly, is relatively smaller though.)
>
>
> This code would already fail with non-contiguous st

Re: [Numpy-discussion] NA-mask interactions with existing C code

2012-05-10 Thread Mark Wiebe
On Thu, May 10, 2012 at 5:47 PM, Dag Sverre Seljebotn <
d.s.seljeb...@astro.uio.no> wrote:

> On 05/11/2012 12:28 AM, Mark Wiebe wrote:
> > I did some searching for typical Cython and C code which accesses numpy
> > arrays, and added a section to the NEP describing how they behave in the
> > current implementation. Cython code which uses either straight Python
> > access or the buffer protocol is fine (after a bugfix in numpy, it
> > wasn't failing currently as it should in the pep3118 case). C code which
> > follows the recommended practice of using PyArray_FromAny or one of the
> > related macros is also fine, because these functions have been made to
> > fail on NA-masked arrays unless the flag NPY_ARRAY_ALLOWNA is provided.
> >
> > In general, code which follows the recommended numpy practices will
> > raise exceptions when encountering NA-masked arrays. This means
> > programmers don't have to worry about the NA unless they want to support
> > it. Having things go through PyArray_FromAny also provides a place where
> > lazy evaluation arrays could be evaluated, and other similar potential
> > future extensions can use to provide compatibility.
> >
> > Here's the section I added to the NEP:
> >
> > Interaction With Pre-existing C API Usage
> > =
> >
> > Making sure existing code using the C API, whether it's written in C,
> C++,
> > or Cython, does something reasonable is an important goal of this
> > implementation.
> > The general strategy is to make existing code which does not explicitly
> > tell numpy it supports NA masks fail with an exception saying so. There
> are
> > a few different access patterns people use to get ahold of the numpy
> > array data,
> > here we examine a few of them to see what numpy can do. These examples
> are
> > found from doing google searches of numpy C API array access.
> >
> > Numpy Documentation - How to extend NumPy
> > -
> >
> >
> http://docs.scipy.org/doc/numpy/user/c-info.how-to-extend.html#dealing-with-array-objects
> >
> > This page has a section "Dealing with array objects" which has some
> > advice for how
> > to access numpy arrays from C. When accepting arrays, the first step it
> > suggests is
> > to use PyArray_FromAny or a macro built on that function, so code
> > following this
> > advice will properly fail when given an NA-masked array it doesn't know
> > how to handle.
> >
> > The way this is handled is that PyArray_FromAny requires a special flag,
> > NPY_ARRAY_ALLOWNA,
> > before it will allow NA-masked arrays to flow through.
> >
> >
> http://docs.scipy.org/doc/numpy/reference/c-api.array.html#NPY_ARRAY_ALLOWNA
> >
> > Code which does not follow this advice, and instead just calls
> > PyArray_Check() to verify
> > its an ndarray and checks some flags, will silently produce incorrect
> > results. This style
> > of code does not provide any opportunity for numpy to say "hey, this
> > array is special",
> > so also is not compatible with future ideas of lazy evaluation, derived
> > dtypes, etc.
>
> This doesn't really cover the Cython code I write that interfaces with C
> (and probably the code others write in Cython).
>
> Often I'd do:
>
> def f(arg):
> cdef np.ndarray arr = np.asarray(arg)
> c_func(np.PyArray_DATA(arr))
>
> So I mix Python np.asarray with C PyArray_DATA. In general, I think you
> use PyArray_FromAny if you're very concerned about performance or need
> some special flag, but it's certainly not the first thing you tgry.
>

I guess this mixture of Python-API and C-API is different from the way the
API tries to protect incorrect access. From the Python API, it should let
everything through, because it's for Python code to use. From the C API, it
should default to not letting things through, because special NA-mask aware
code needs to be written. I'm not sure if there is a reasonable approach
here which works for everything.


> But in general, I will often be lazy and just do
>
> def f(np.ndarray arr):
> c_func(np.PyArray_DATA(arr))
>
> It's an exception if you don't provide an array -- so who cares. (I
> guess the odds of somebody feeding a masked array to code like that,
> which doesn't try to be friendly, is relatively smaller though.)
>

This code would already fail with non-contiguous strides or byte-swapped
data, so the additional NA mask case seems to fit in an already-failing
category.


>
> If you know the datatype, you can really do
>
> def f(np.ndarray[double] arr):
> c_func(&arr[0])
>
> which works with PEP 3118. But I use PyArray_DATA out of habit (and
> since it works in the cases without dtype).
>
> Frankly, I don't expect any Cython code to do the right thing here;
> calling PyArray_FromAny is much more typing. And really, nobody ever
> questioned that if we had an actual ndarray instance, we'd be allowed to
> call PyArray_DATA.
>
> I don't know how much Cython code is out there in the wild for which
> this is a problem. Eithe

Re: [Numpy-discussion] NA-mask interactions with existing C code

2012-05-10 Thread Dag Sverre Seljebotn
On 05/11/2012 12:47 AM, Dag Sverre Seljebotn wrote:
> On 05/11/2012 12:28 AM, Mark Wiebe wrote:
>> I did some searching for typical Cython and C code which accesses numpy
>> arrays, and added a section to the NEP describing how they behave in the
>> current implementation. Cython code which uses either straight Python
>> access or the buffer protocol is fine (after a bugfix in numpy, it
>> wasn't failing currently as it should in the pep3118 case). C code which
>> follows the recommended practice of using PyArray_FromAny or one of the
>> related macros is also fine, because these functions have been made to
>> fail on NA-masked arrays unless the flag NPY_ARRAY_ALLOWNA is provided.
>>
>> In general, code which follows the recommended numpy practices will
>> raise exceptions when encountering NA-masked arrays. This means
>> programmers don't have to worry about the NA unless they want to support
>> it. Having things go through PyArray_FromAny also provides a place where
>> lazy evaluation arrays could be evaluated, and other similar potential
>> future extensions can use to provide compatibility.
>>
>> Here's the section I added to the NEP:
>>
>> Interaction With Pre-existing C API Usage
>> =
>>
>> Making sure existing code using the C API, whether it's written in C, C++,
>> or Cython, does something reasonable is an important goal of this
>> implementation.
>> The general strategy is to make existing code which does not explicitly
>> tell numpy it supports NA masks fail with an exception saying so. There are
>> a few different access patterns people use to get ahold of the numpy
>> array data,
>> here we examine a few of them to see what numpy can do. These examples are
>> found from doing google searches of numpy C API array access.
>>
>> Numpy Documentation - How to extend NumPy
>> -
>>
>> http://docs.scipy.org/doc/numpy/user/c-info.how-to-extend.html#dealing-with-array-objects
>>
>> This page has a section "Dealing with array objects" which has some
>> advice for how
>> to access numpy arrays from C. When accepting arrays, the first step it
>> suggests is
>> to use PyArray_FromAny or a macro built on that function, so code
>> following this
>> advice will properly fail when given an NA-masked array it doesn't know
>> how to handle.
>>
>> The way this is handled is that PyArray_FromAny requires a special flag,
>> NPY_ARRAY_ALLOWNA,
>> before it will allow NA-masked arrays to flow through.
>>
>> http://docs.scipy.org/doc/numpy/reference/c-api.array.html#NPY_ARRAY_ALLOWNA
>>
>> Code which does not follow this advice, and instead just calls
>> PyArray_Check() to verify
>> its an ndarray and checks some flags, will silently produce incorrect
>> results. This style
>> of code does not provide any opportunity for numpy to say "hey, this
>> array is special",
>> so also is not compatible with future ideas of lazy evaluation, derived
>> dtypes, etc.
>
> This doesn't really cover the Cython code I write that interfaces with C
> (and probably the code others write in Cython).
>
> Often I'd do:
>
> def f(arg):
>   cdef np.ndarray arr = np.asarray(arg)
>   c_func(np.PyArray_DATA(arr))
>
> So I mix Python np.asarray with C PyArray_DATA. In general, I think you
> use PyArray_FromAny if you're very concerned about performance or need
> some special flag, but it's certainly not the first thing you tgry.
>
> But in general, I will often be lazy and just do
>
> def f(np.ndarray arr):
>   c_func(np.PyArray_DATA(arr))
>
> It's an exception if you don't provide an array -- so who cares. (I
> guess the odds of somebody feeding a masked array to code like that,
> which doesn't try to be friendly, is relatively smaller though.)
>
> If you know the datatype, you can really do
>
> def f(np.ndarray[double] arr):
>   c_func(&arr[0])
>
> which works with PEP 3118. But I use PyArray_DATA out of habit (and
> since it works in the cases without dtype).
>
> Frankly, I don't expect any Cython code to do the right thing here;
> calling PyArray_FromAny is much more typing. And really, nobody ever
> questioned that if we had an actual ndarray instance, we'd be allowed to
> call PyArray_DATA.
>
> I don't know how much Cython code is out there in the wild for which
> this is a problem. Either way, it would cause something of a reeducation
> challenge for Cython users.

Also note that Cython users are in the habit of accessing "arr.data" 
(which is the char*, not the buffer object) directly. Just in case you 
had the idea of grepping for PyArray_DATA in Cython code.

Our plan there is we'll eventually put out a Cython version which 
special-cases np.ndarray and turn ".data" into a call to PyArray_DATA 
(and same for shape, strides, ...). Ugly hack, but avoids breaking 
existing Cython code if NumPy removes the field access.

Dag


>
> Dag
>
>>
>> Tutorial From Cython Website
>> 
>>
>> http://docs.cython.org/src/tutorial/nu

Re: [Numpy-discussion] NA-mask interactions with existing C code

2012-05-10 Thread Dag Sverre Seljebotn
On 05/11/2012 12:28 AM, Mark Wiebe wrote:
> I did some searching for typical Cython and C code which accesses numpy
> arrays, and added a section to the NEP describing how they behave in the
> current implementation. Cython code which uses either straight Python
> access or the buffer protocol is fine (after a bugfix in numpy, it
> wasn't failing currently as it should in the pep3118 case). C code which
> follows the recommended practice of using PyArray_FromAny or one of the
> related macros is also fine, because these functions have been made to
> fail on NA-masked arrays unless the flag NPY_ARRAY_ALLOWNA is provided.
>
> In general, code which follows the recommended numpy practices will
> raise exceptions when encountering NA-masked arrays. This means
> programmers don't have to worry about the NA unless they want to support
> it. Having things go through PyArray_FromAny also provides a place where
> lazy evaluation arrays could be evaluated, and other similar potential
> future extensions can use to provide compatibility.
>
> Here's the section I added to the NEP:
>
> Interaction With Pre-existing C API Usage
> =
>
> Making sure existing code using the C API, whether it's written in C, C++,
> or Cython, does something reasonable is an important goal of this
> implementation.
> The general strategy is to make existing code which does not explicitly
> tell numpy it supports NA masks fail with an exception saying so. There are
> a few different access patterns people use to get ahold of the numpy
> array data,
> here we examine a few of them to see what numpy can do. These examples are
> found from doing google searches of numpy C API array access.
>
> Numpy Documentation - How to extend NumPy
> -
>
> http://docs.scipy.org/doc/numpy/user/c-info.how-to-extend.html#dealing-with-array-objects
>
> This page has a section "Dealing with array objects" which has some
> advice for how
> to access numpy arrays from C. When accepting arrays, the first step it
> suggests is
> to use PyArray_FromAny or a macro built on that function, so code
> following this
> advice will properly fail when given an NA-masked array it doesn't know
> how to handle.
>
> The way this is handled is that PyArray_FromAny requires a special flag,
> NPY_ARRAY_ALLOWNA,
> before it will allow NA-masked arrays to flow through.
>
> http://docs.scipy.org/doc/numpy/reference/c-api.array.html#NPY_ARRAY_ALLOWNA
>
> Code which does not follow this advice, and instead just calls
> PyArray_Check() to verify
> its an ndarray and checks some flags, will silently produce incorrect
> results. This style
> of code does not provide any opportunity for numpy to say "hey, this
> array is special",
> so also is not compatible with future ideas of lazy evaluation, derived
> dtypes, etc.

This doesn't really cover the Cython code I write that interfaces with C 
(and probably the code others write in Cython).

Often I'd do:

def f(arg):
 cdef np.ndarray arr = np.asarray(arg)
 c_func(np.PyArray_DATA(arr))

So I mix Python np.asarray with C PyArray_DATA. In general, I think you 
use PyArray_FromAny if you're very concerned about performance or need 
some special flag, but it's certainly not the first thing you tgry.

But in general, I will often be lazy and just do

def f(np.ndarray arr):
 c_func(np.PyArray_DATA(arr))

It's an exception if you don't provide an array -- so who cares. (I 
guess the odds of somebody feeding a masked array to code like that, 
which doesn't try to be friendly, is relatively smaller though.)

If you know the datatype, you can really do

def f(np.ndarray[double] arr):
 c_func(&arr[0])

which works with PEP 3118. But I use PyArray_DATA out of habit (and 
since it works in the cases without dtype).

Frankly, I don't expect any Cython code to do the right thing here; 
calling PyArray_FromAny is much more typing. And really, nobody ever 
questioned that if we had an actual ndarray instance, we'd be allowed to 
call PyArray_DATA.

I don't know how much Cython code is out there in the wild for which 
this is a problem. Either way, it would cause something of a reeducation 
challenge for Cython users.

Dag

>
> Tutorial From Cython Website
> 
>
> http://docs.cython.org/src/tutorial/numpy.html
>
> This tutorial gives a convolution example, and all the examples fail with
> Python exceptions when given inputs that contain NA values.
>
> Before any Cython type annotation is introduced, the code functions just
> as equivalent Python would in the interpreter.
>
> When the type information is introduced, it is done via numpy.pxd which
> defines a mapping between an ndarray declaration and PyArrayObject \*.
> Under the hood, this maps to __Pyx_ArgTypeTest, which does a direct
> comparison of Py_TYPE(obj) against the PyTypeObject for the ndarray.
>
> Then the code does some dtype comparisons, and uses regular python indexing
> to ac

Re: [Numpy-discussion] Masking through generator arrays

2012-05-10 Thread Mark Wiebe
On Thu, May 10, 2012 at 5:27 PM, Dag Sverre Seljebotn <
d.s.seljeb...@astro.uio.no> wrote:

> On 05/10/2012 08:23 PM, Chris Barker wrote:
> > On Thu, May 10, 2012 at 2:38 AM, Dag Sverre Seljebotn
> >   wrote:
> >> What would serve me? I use NumPy as a glorified "double*".
> >
> >> all I want is my glorified
> >> "double*". I'm probably not a representative user.)
> >
> > Actually, I think you are representative of a LOT of users -- it
> > turns, out, whether Jim Huginin originally was thinking this way or
> > not, but numpy arrays are really powerful because the provide BOTH and
> > nifty, full featured array object in Python, AND a wrapper around a
> > generic "double*" (actually char*, that could be any type).
> >
> > This is are really widely used feature, and has become even more so
> > with Cython's numpy support.
> >
> > That is one of my concerns about the "bit pattern" idea -- we've then
> > created a new binary type that no other standard software understands
> > -- that looks like a a lot of work to me to deal with, or even worse,
> > ripe for weird, non-obvious errors in code that access that good-old
> > char*.
> >
> > So I'm happier with a mask implementation -- more memory, yes, but it
> > seems more robust an easy to deal with with outside code.
>
> It's very interesting that you consider masks easier to integrate with
> C/C++ code than bitpatterns. I guess everybody's experience (and every
> C/C++/Fortran code base) is different.
>
> >
> > But either way, Dag's key point is right on -- in Cython (or any other
> > code) -- we need to make sure ti's easy to get a regular old pointer
> > to a regular old C array, and get something else by accident.
>
> I'm sorry if I caused any confusion -- I didn't mean to suggest that
> anybody would ever remove the ability of getting a pointer to an
> unmasked array.
>
> There is a problem that's being discussed of the opposite nature:
>
> With masked arrays, the current situation in NumPy trunk is that if
> you're presented with a masked array, and do not explicitly check for a
> mask (i.e., all existing code), you'll transparently and without warning
> "unmask" it -- that is, an element has the last value before NA was
> assigned. This is the case whether you use PEP 3118 (np.ndarray[double]
> or double[:]), or PyArray_DATA.
>
> According to the NEP, you should really get an exception when accessing
> through PEP 3118, but this seems to not be implemented. I don't know
> whether this was a conscious change or a lack of implementation (?).
>

This was an error, I've made a pull request to fix it.


> PyArray_DATA will continue to transparently unmask data. However, with
> Travis' proposal of making a new 'ndmasked' type, old code will be
> protected; it will raise an exception for masked arrays instead of
> transparently unmasking, giving the user a chance to work around it (or
> update the code to work with masks).
>

In searching for example code, the examples I found and the numpy
documentation recommend using the PyArray_FromAny or related functions to
sanitize the array before use. This provides a place to stop NA-masked
arrays and raise an exception. Is there a lot of code out there which isn't
following this practice?

Cheers,
Mark


> Regarding new code that you write to be mask-aware, fear not -- you can
> use PyArray_DATA and PyArray_MASKNA_DATA to get the pointers. You can't
> really access the mask using np.ndarray[uint8] or uint8[:], but it
> wouldn't be a problem for NumPy to provide such access for Cython users.
>
> Regarding native Cython support for masks, bitpatterns would be a quick
> job and an uncontroversial feature, we just need to agree on an
> extension to the PEP 3118 format string with NumPy and then it takes a
> few hours to implement it. Masks would require quite some hashing out on
> the Cython email list to figure out whether and how we would want to
> support it, and is quite some more development work as well. How we'd
> even do that is much more vague to me.
>
> Dag
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] NA-mask interactions with existing C code

2012-05-10 Thread Mark Wiebe
I did some searching for typical Cython and C code which accesses numpy
arrays, and added a section to the NEP describing how they behave in the
current implementation. Cython code which uses either straight Python
access or the buffer protocol is fine (after a bugfix in numpy, it wasn't
failing currently as it should in the pep3118 case). C code which follows
the recommended practice of using PyArray_FromAny or one of the related
macros is also fine, because these functions have been made to fail on
NA-masked arrays unless the flag NPY_ARRAY_ALLOWNA is provided.

In general, code which follows the recommended numpy practices will raise
exceptions when encountering NA-masked arrays. This means programmers don't
have to worry about the NA unless they want to support it. Having things go
through PyArray_FromAny also provides a place where lazy evaluation arrays
could be evaluated, and other similar potential future extensions can use
to provide compatibility.

Here's the section I added to the NEP:

Interaction With Pre-existing C API Usage
=

Making sure existing code using the C API, whether it's written in C, C++,
or Cython, does something reasonable is an important goal of this
implementation.
The general strategy is to make existing code which does not explicitly
tell numpy it supports NA masks fail with an exception saying so. There are
a few different access patterns people use to get ahold of the numpy array
data,
here we examine a few of them to see what numpy can do. These examples are
found from doing google searches of numpy C API array access.

Numpy Documentation - How to extend NumPy
-

http://docs.scipy.org/doc/numpy/user/c-info.how-to-extend.html#dealing-with-array-objects

This page has a section "Dealing with array objects" which has some advice
for how
to access numpy arrays from C. When accepting arrays, the first step it
suggests is
to use PyArray_FromAny or a macro built on that function, so code following
this
advice will properly fail when given an NA-masked array it doesn't know how
to handle.

The way this is handled is that PyArray_FromAny requires a special flag,
NPY_ARRAY_ALLOWNA,
before it will allow NA-masked arrays to flow through.

http://docs.scipy.org/doc/numpy/reference/c-api.array.html#NPY_ARRAY_ALLOWNA

Code which does not follow this advice, and instead just calls
PyArray_Check() to verify
its an ndarray and checks some flags, will silently produce incorrect
results. This style
of code does not provide any opportunity for numpy to say "hey, this array
is special",
so also is not compatible with future ideas of lazy evaluation, derived
dtypes, etc.

Tutorial From Cython Website


http://docs.cython.org/src/tutorial/numpy.html

This tutorial gives a convolution example, and all the examples fail with
Python exceptions when given inputs that contain NA values.

Before any Cython type annotation is introduced, the code functions just
as equivalent Python would in the interpreter.

When the type information is introduced, it is done via numpy.pxd which
defines a mapping between an ndarray declaration and PyArrayObject \*.
Under the hood, this maps to __Pyx_ArgTypeTest, which does a direct
comparison of Py_TYPE(obj) against the PyTypeObject for the ndarray.

Then the code does some dtype comparisons, and uses regular python indexing
to access the array elements. This python indexing still goes through the
Python API, so the NA handling and error checking in numpy still can work
like normal and fail if the inputs have NAs which cannot fit in the output
array. In this case it fails when trying to convert the NA into an integer
to set in in the output.

The next version of the code introduces more efficient indexing. This
operates based on Python's buffer protocol. This causes Cython to call
__Pyx_GetBufferAndValidate, which calls __Pyx_GetBuffer, which calls
PyObject_GetBuffer. This call gives numpy the opportunity to raise an
exception if the inputs are arrays with NA-masks, something not supported
by the Python buffer protocol.

Numerical Python - JPL website
--

http://dsnra.jpl.nasa.gov/software/Python/numpydoc/numpy-13.html

This document is from 2001, so does not reflect recent numpy, but it is the
second hit when searching for "numpy c api example" on google.

There first example, heading "A simple example", is in fact already invalid
for
recent numpy even without the NA support. In particular, if the data is
misaligned
or in a different byteorder, it may crash or produce incorrect results.

The next thing the document does is introduce PyArray_ContiguousFromObject,
which
gives numpy an opportunity to raise an exception when NA-masked arrays are
used,
so the later code will raise exceptions as desired.
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/l

Re: [Numpy-discussion] Masking through generator arrays

2012-05-10 Thread Dag Sverre Seljebotn
On 05/10/2012 08:23 PM, Chris Barker wrote:
> On Thu, May 10, 2012 at 2:38 AM, Dag Sverre Seljebotn
>   wrote:
>> What would serve me? I use NumPy as a glorified "double*".
>
>> all I want is my glorified
>> "double*". I'm probably not a representative user.)
>
> Actually, I think you are representative of a LOT of users -- it
> turns, out, whether Jim Huginin originally was thinking this way or
> not, but numpy arrays are really powerful because the provide BOTH and
> nifty, full featured array object in Python, AND a wrapper around a
> generic "double*" (actually char*, that could be any type).
>
> This is are really widely used feature, and has become even more so
> with Cython's numpy support.
>
> That is one of my concerns about the "bit pattern" idea -- we've then
> created a new binary type that no other standard software understands
> -- that looks like a a lot of work to me to deal with, or even worse,
> ripe for weird, non-obvious errors in code that access that good-old
> char*.
>
> So I'm happier with a mask implementation -- more memory, yes, but it
> seems more robust an easy to deal with with outside code.

It's very interesting that you consider masks easier to integrate with 
C/C++ code than bitpatterns. I guess everybody's experience (and every 
C/C++/Fortran code base) is different.

>
> But either way, Dag's key point is right on -- in Cython (or any other
> code) -- we need to make sure ti's easy to get a regular old pointer
> to a regular old C array, and get something else by accident.

I'm sorry if I caused any confusion -- I didn't mean to suggest that 
anybody would ever remove the ability of getting a pointer to an 
unmasked array.

There is a problem that's being discussed of the opposite nature:

With masked arrays, the current situation in NumPy trunk is that if 
you're presented with a masked array, and do not explicitly check for a 
mask (i.e., all existing code), you'll transparently and without warning 
"unmask" it -- that is, an element has the last value before NA was 
assigned. This is the case whether you use PEP 3118 (np.ndarray[double] 
or double[:]), or PyArray_DATA.

According to the NEP, you should really get an exception when accessing 
through PEP 3118, but this seems to not be implemented. I don't know 
whether this was a conscious change or a lack of implementation (?).

PyArray_DATA will continue to transparently unmask data. However, with 
Travis' proposal of making a new 'ndmasked' type, old code will be 
protected; it will raise an exception for masked arrays instead of 
transparently unmasking, giving the user a chance to work around it (or 
update the code to work with masks).

Regarding new code that you write to be mask-aware, fear not -- you can 
use PyArray_DATA and PyArray_MASKNA_DATA to get the pointers. You can't 
really access the mask using np.ndarray[uint8] or uint8[:], but it 
wouldn't be a problem for NumPy to provide such access for Cython users.

Regarding native Cython support for masks, bitpatterns would be a quick 
job and an uncontroversial feature, we just need to agree on an 
extension to the PEP 3118 format string with NumPy and then it takes a 
few hours to implement it. Masks would require quite some hashing out on 
the Cython email list to figure out whether and how we would want to 
support it, and is quite some more development work as well. How we'd 
even do that is much more vague to me.

Dag
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] ANN: NumPy 1.6.2 release candidate 1

2012-05-10 Thread Frédéric Bastien
It is just to tell that all test pass here on Fedora 14 and all Theano
pass with this rc.

thanks

Fred

On Wed, May 9, 2012 at 11:05 PM, Charles R Harris
 wrote:
>
>
> On Wed, May 9, 2012 at 12:40 PM, Sandro Tosi  wrote:
>>
>> On Sat, May 5, 2012 at 8:15 PM, Ralf Gommers
>>  wrote:
>> > Please test this release and report any issues on the numpy-discussion
>> > mailing list.
>>
>> I think it's probably nice not to ship pyc in the source tarball:
>>
>> $ find numpy-1.6.2rc1/ -name "*.pyc"
>> numpy-1.6.2rc1/doc/sphinxext/docscrape.pyc
>> numpy-1.6.2rc1/doc/sphinxext/docscrape_sphinx.pyc
>> numpy-1.6.2rc1/doc/sphinxext/numpydoc.pyc
>> numpy-1.6.2rc1/doc/sphinxext/plot_directive.pyc
>>
>
> Good point ;)
>
> Chuck
>
>
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] spurious space in printing record arrays?

2012-05-10 Thread Benjamin Root
Just noticed this in the output from printing some numpy record arrays:

[[('2008081712', -24, -78.0, 20.10381469727, 45.0, -999.0, 0.0)]
 [ ('2008081718', -18, -79.584741211, 20.70762939453, 45.0, -999.0,
0.0)]
 [ ('2008081800', -12, -80.3305175781, 21.10381469727, 45.0,
-999.0, 0.0)]
 [ ('2008081806', -6, -80.8305175781, 21.89618530273, 45.0, -999.0,
0.0)]
 [ ('2008081812', 0, -81.1694824219, 23.20762939453, 50.0, -999.0,
1002.0)]]


[[ ('2008081812', 0, -81.1694824219, 23.20762939453, 50.0, -999.0,
0.0)]
 [('2008081815', 3, -81.5, 23.60381469727, 50.0, -999.0, 1003.0)]
 [ ('2008081900', 12, -81.8305175781, 24.60381469727, 55.0, -999.0,
0.0)]
 [ ('2008081912', 24, -82.084741211, 26.20762939453, 65.0, -999.0,
0.0)]
 [('2008082000', 36, -82.0, 27.79237060547, 50.0, -999.0, 0.0)]
 [ ('2008082012', 48, -81.8305175781, 29.29237060547, 40.0, -999.0,
0.0)]
 [('2008082112', 72, -81.5, 31.5, 35.0, -999.0, 0.0)]
 [('2008082212', 96, -81.5, 33.58474121094, 25.0, -999.0, 0.0)]
 [('2008082312', 120, -82.5, 35.5, 20.0, -999.0, 0.0)]]

On my 80-character wide terminal window, each line that gets wrapped also
has an extra space after the inner square bracket.  Coincidence? Using
v1.6.1

I don't think it is a big problem... just odd.

Thanks,
Ben Root
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Masking through generator arrays

2012-05-10 Thread Travis Oliphant

On May 10, 2012, at 1:23 PM, Chris Barker wrote:

> On Thu, May 10, 2012 at 2:38 AM, Dag Sverre Seljebotn
>  wrote:
>> What would serve me? I use NumPy as a glorified "double*".
> 
>> all I want is my glorified
>> "double*". I'm probably not a representative user.)
> 
> Actually, I think you are representative of a LOT of users -- it
> turns, out, whether Jim Huginin originally was thinking this way or
> not, but numpy arrays are really powerful because the provide BOTH and
> nifty, full featured array object in Python, AND a wrapper around a
> generic "double*" (actually char*, that could be any type).
> 
> This is are really widely used feature, and has become even more so
> with Cython's numpy support.
> 
> That is one of my concerns about the "bit pattern" idea -- we've then
> created a new binary type that no other standard software understands
> -- that looks like a a lot of work to me to deal with, or even worse,
> ripe for weird, non-obvious errors in code that access that good-old
> char*.

This needs to be clarified,  the point of the "bit pattern" idea is that the 
downstream user would have to actually *request* data in that format or they 
would get an error. You would not get it by "accident".   If you asked for 
an array of floats you would get an array of floats (not an array of 
NA-floats).  

R has *already* created this binary type and we are just including the ability 
to understand it in NumPy. 

This is why it is an easy thing to do without changing the structure of what a 
NumPy array *is*.   Adding the concept of a mask to *every* NumPy array (even 
NumPy arrays that are currently being used in the wild to represent masks) is 
the big change that I don't think should happen. 

-Travis

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Masking through generator arrays

2012-05-10 Thread Charles R Harris
On Thu, May 10, 2012 at 1:14 PM, Charles R Harris  wrote:

>
>
> On Thu, May 10, 2012 at 12:52 PM, Scott Ransom  wrote:
>
>> On 05/10/2012 02:23 PM, Chris Barker wrote:
>> > On Thu, May 10, 2012 at 2:38 AM, Dag Sverre Seljebotn
>> >   wrote:
>> >> What would serve me? I use NumPy as a glorified "double*".
>> >
>> >> all I want is my glorified
>> >> "double*". I'm probably not a representative user.)
>> >
>> > Actually, I think you are representative of a LOT of users -- it
>> > turns, out, whether Jim Huginin originally was thinking this way or
>> > not, but numpy arrays are really powerful because the provide BOTH and
>> > nifty, full featured array object in Python, AND a wrapper around a
>> > generic "double*" (actually char*, that could be any type).
>> >
>> > This is are really widely used feature, and has become even more so
>> > with Cython's numpy support.
>> >
>> > That is one of my concerns about the "bit pattern" idea -- we've then
>> > created a new binary type that no other standard software understands
>> > -- that looks like a a lot of work to me to deal with, or even worse,
>> > ripe for weird, non-obvious errors in code that access that good-old
>> > char*.
>> >
>> > So I'm happier with a mask implementation -- more memory, yes, but it
>> > seems more robust an easy to deal with with outside code.
>> >
>> > But either way, Dag's key point is right on -- in Cython (or any other
>> > code) -- we need to make sure ti's easy to get a regular old pointer
>> > to a regular old C array, and get something else by accident.
>> >
>> > -Chris
>>
>> Agreed.  (As someone who has been heavily using Numpy since the early
>> days of numeric, and who wrote and maintains a suite of scientific
>> software that uses Numpy and its C-API in exactly this way.)
>>
>> Note that I wasn't aware that the proposed mask implementation might (or
>> would?) change this behavior...  (and hopefully I haven't just
>> misinterpreted these last few emails.  If so, I apologize.).
>>
>>
> I haven't seen a change in this behavior, otherwise most of current numpy
> would break.
>
>
I suspect this rumour comes from some ideas for generator arrays (not
mine), but I would strongly oppose anything that changes things that much.

Chuck
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Masking through generator arrays

2012-05-10 Thread Charles R Harris
On Thu, May 10, 2012 at 12:52 PM, Scott Ransom  wrote:

> On 05/10/2012 02:23 PM, Chris Barker wrote:
> > On Thu, May 10, 2012 at 2:38 AM, Dag Sverre Seljebotn
> >   wrote:
> >> What would serve me? I use NumPy as a glorified "double*".
> >
> >> all I want is my glorified
> >> "double*". I'm probably not a representative user.)
> >
> > Actually, I think you are representative of a LOT of users -- it
> > turns, out, whether Jim Huginin originally was thinking this way or
> > not, but numpy arrays are really powerful because the provide BOTH and
> > nifty, full featured array object in Python, AND a wrapper around a
> > generic "double*" (actually char*, that could be any type).
> >
> > This is are really widely used feature, and has become even more so
> > with Cython's numpy support.
> >
> > That is one of my concerns about the "bit pattern" idea -- we've then
> > created a new binary type that no other standard software understands
> > -- that looks like a a lot of work to me to deal with, or even worse,
> > ripe for weird, non-obvious errors in code that access that good-old
> > char*.
> >
> > So I'm happier with a mask implementation -- more memory, yes, but it
> > seems more robust an easy to deal with with outside code.
> >
> > But either way, Dag's key point is right on -- in Cython (or any other
> > code) -- we need to make sure ti's easy to get a regular old pointer
> > to a regular old C array, and get something else by accident.
> >
> > -Chris
>
> Agreed.  (As someone who has been heavily using Numpy since the early
> days of numeric, and who wrote and maintains a suite of scientific
> software that uses Numpy and its C-API in exactly this way.)
>
> Note that I wasn't aware that the proposed mask implementation might (or
> would?) change this behavior...  (and hopefully I haven't just
> misinterpreted these last few emails.  If so, I apologize.).
>
>
I haven't seen a change in this behavior, otherwise most of current numpy
would break.

Chuck
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Masking through generator arrays

2012-05-10 Thread Inati, Souheil (NIH/NIMH) [E]

On May 10, 2012, at 2:23 PM, Chris Barker wrote:

> On Thu, May 10, 2012 at 2:38 AM, Dag Sverre Seljebotn
>  wrote:
>> What would serve me? I use NumPy as a glorified "double*".
> 
>> all I want is my glorified
>> "double*". I'm probably not a representative user.)
> 
> Actually, I think you are representative of a LOT of users -- it
> turns, out, whether Jim Huginin originally was thinking this way or
> not, but numpy arrays are really powerful because the provide BOTH and
> nifty, full featured array object in Python, AND a wrapper around a
> generic "double*" (actually char*, that could be any type).
> 
> This is are really widely used feature, and has become even more so
> with Cython's numpy support.
> 
> That is one of my concerns about the "bit pattern" idea -- we've then
> created a new binary type that no other standard software understands
> -- that looks like a a lot of work to me to deal with, or even worse,
> ripe for weird, non-obvious errors in code that access that good-old
> char*.
> 
> So I'm happier with a mask implementation -- more memory, yes, but it
> seems more robust an easy to deal with with outside code.
> 
> But either way, Dag's key point is right on -- in Cython (or any other
> code) -- we need to make sure ti's easy to get a regular old pointer
> to a regular old C array, and get something else by accident.
> 
> -Chris
> 
> 

+1

As a physicist who uses numpy to develop MRI image reconstruction and data 
analysis methods, I really do think of numpy as a glorified double with a nice 
way to call useful numerical methods.  I also use external methods all the time 
and it's of the utmost importance to have a pointer to a block of data that I 
can say is N complex doubles or something.  Using a separate array for a mask 
is not a big deal.  At worst it's a factor of 2 in memory.  It forces me to pay 
attention to what I'm doing, and if I want to do an SVD on my data, I better 
keep track of what I'm doing myself.

I am not that old, but I'm old enough to remember when matlab was really just 
this - glorified double with a nice slicing/view interface and a thin wrapper 
around eispack and linpack.  (here is a great article by Cleve Moler from 2000: 
http://www.mathworks.com/company/newsletters/news_notes/clevescorner/winter2000.cleve.html).
  You used to read in some ints from a data file and they converted it to 
double and you knew that if you got numerical precision errors it was because 
your algorithm was wrong or you were inverting some nearly singular matrix or 
something, not because of overflow.  And they made a copy of the data every 
time you called a function.  It had serious limitations, but what it did just 
worked.  And then they started to get fancy and it took them a REALLY long time 
and a lot of versions and man hours to get that all sorted out, with lazy 
evaluations and classes and sparse arrays and all that.

I'm not saying what the developers of numpy should do about the masked array 
thing and I really can't comment on how other people use numpy.  I also don't 
really have much of a say about the technical implementations of the guts of 
numpy, but it's worth asking really simple questions like:  I want to do an SVD 
on a 2D array with some missing or masked data.  What should happen?  This 
seems like such a simple question, but really it is incredibly complicated, or 
rather, it's very hard for numpy which is a foundation framework type of code 
to guess what the user means.

Anyway, that's my point of view.  I'm really happy numpy exists and works as 
well as it does and I'm thankful that there are developers out there that can 
build something so useful.

Cheers,
Souheil

--
Souheil Inati, PhD
Staff Scientist
Functional MRI Facility
NIMH/NIH


> 
> 
> 
> 
> 
> -- 
> 
> Christopher Barker, Ph.D.
> Oceanographer
> 
> Emergency Response Division
> NOAA/NOS/OR&R(206) 526-6959   voice
> 7600 Sand Point Way NE   (206) 526-6329   fax
> Seattle, WA  98115   (206) 526-6317   main reception
> 
> chris.bar...@noaa.gov
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Masking through generator arrays

2012-05-10 Thread Scott Ransom
On 05/10/2012 02:23 PM, Chris Barker wrote:
> On Thu, May 10, 2012 at 2:38 AM, Dag Sverre Seljebotn
>   wrote:
>> What would serve me? I use NumPy as a glorified "double*".
>
>> all I want is my glorified
>> "double*". I'm probably not a representative user.)
>
> Actually, I think you are representative of a LOT of users -- it
> turns, out, whether Jim Huginin originally was thinking this way or
> not, but numpy arrays are really powerful because the provide BOTH and
> nifty, full featured array object in Python, AND a wrapper around a
> generic "double*" (actually char*, that could be any type).
>
> This is are really widely used feature, and has become even more so
> with Cython's numpy support.
>
> That is one of my concerns about the "bit pattern" idea -- we've then
> created a new binary type that no other standard software understands
> -- that looks like a a lot of work to me to deal with, or even worse,
> ripe for weird, non-obvious errors in code that access that good-old
> char*.
>
> So I'm happier with a mask implementation -- more memory, yes, but it
> seems more robust an easy to deal with with outside code.
>
> But either way, Dag's key point is right on -- in Cython (or any other
> code) -- we need to make sure ti's easy to get a regular old pointer
> to a regular old C array, and get something else by accident.
>
> -Chris

Agreed.  (As someone who has been heavily using Numpy since the early 
days of numeric, and who wrote and maintains a suite of scientific 
software that uses Numpy and its C-API in exactly this way.)

Note that I wasn't aware that the proposed mask implementation might (or 
would?) change this behavior...  (and hopefully I haven't just 
misinterpreted these last few emails.  If so, I apologize.).

Cheers,

Scott

-- 
Scott M. RansomAddress:  NRAO
Phone:  (434) 296-0320   520 Edgemont Rd.
email:  sran...@nrao.edu Charlottesville, VA 22903 USA
GPG Fingerprint: 06A9 9553 78BE 16DB 407B  FFCA 9BFA B6FF FFD3 2989
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Masking through generator arrays

2012-05-10 Thread Chris Barker
On Thu, May 10, 2012 at 2:38 AM, Dag Sverre Seljebotn
 wrote:
> What would serve me? I use NumPy as a glorified "double*".

> all I want is my glorified
> "double*". I'm probably not a representative user.)

Actually, I think you are representative of a LOT of users -- it
turns, out, whether Jim Huginin originally was thinking this way or
not, but numpy arrays are really powerful because the provide BOTH and
nifty, full featured array object in Python, AND a wrapper around a
generic "double*" (actually char*, that could be any type).

This is are really widely used feature, and has become even more so
with Cython's numpy support.

That is one of my concerns about the "bit pattern" idea -- we've then
created a new binary type that no other standard software understands
-- that looks like a a lot of work to me to deal with, or even worse,
ripe for weird, non-obvious errors in code that access that good-old
char*.

So I'm happier with a mask implementation -- more memory, yes, but it
seems more robust an easy to deal with with outside code.

But either way, Dag's key point is right on -- in Cython (or any other
code) -- we need to make sure ti's easy to get a regular old pointer
to a regular old C array, and get something else by accident.

-Chris







-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] read line mixed with string and number?

2012-05-10 Thread Chao YUE
Dear all,

I have files which contain lines like this:

30516F5  Sep1985  1-Day RainTrace   0.23.2   Trace   0.0
0.00.00.00.20.0   Trace  29.20.00.00.0
0.01.8
30516F5  Sep1985  1-Day SnowTrace   0.00.00.0   14.8
10.1   Trace   0.00.00.00.00.00.00.00.0
Trace  Trace   0.0
30516F5  Sep1985  1-Day Pcpn.   Trace   0.23.2   Trace  18.9
9.8   Trace   0.00.20.0   Trace  29.20.00.00.0
Trace   1.80.0
30516F5  May1986  Max. Temp. Misg   Misg   Misg   Misg   Misg
Misg   9.08.08.00.06.01.01.0   -3.03.
30516F5  May1986  Min. Temp. Misg   Misg   Misg   Misg   Misg
Misg   Misg  -1.0   -2.0   -6.0   -5.0   -5.0   -3.0   -7.0   -6.0   -5.0
-3.0


different columns were separated by blank spaces. with the first column as
sitename, second as month name, then year, then variable name and data.

I want to read them line by line into a list, and then connect all the
numerical data within one year into a list, and then combining different
year data into one masked ndarray,
in this process, I check the flags (Trace, Misg, etc.) and replace them as
unique values (or missing values). and then begin to analyse the data. each
file contains only one site,
it can be big or small depending on the number of years.

I don't know what's the good way to do this job. what I am thinking is to
read one file line by line, and then divide this line by blank space, and
replace special flag. but during this process,
I need to do type conversion.

any suggestion would be appreciated.

Chao

-- 
***
Chao YUE
Laboratoire des Sciences du Climat et de l'Environnement (LSCE-IPSL)
UMR 1572 CEA-CNRS-UVSQ
Batiment 712 - Pe 119
91191 GIF Sur YVETTE Cedex
Tel: (33) 01 69 08 29 02; Fax:01.69.08.77.16

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Masking through generator arrays

2012-05-10 Thread Dag Sverre Seljebotn
On 05/10/2012 11:38 AM, Dag Sverre Seljebotn wrote:
> On 05/10/2012 10:40 AM, Charles R Harris wrote:
>>
>>
>> On Thu, May 10, 2012 at 1:10 AM, Dag Sverre Seljebotn
>> mailto:d.s.seljeb...@astro.uio.no>>  wrote:
>>
>>  On 05/10/2012 06:18 AM, Charles R Harris wrote:
>>   >
>>   >
>>   >  On Wed, May 9, 2012 at 9:54 PM, Dag Sverre Seljebotn
>>   >  mailto:d.s.seljeb...@astro.uio.no>
>>  >  >>  wrote:
>>   >
>>   >  Sorry everyone for being so dense and contaminating that
>>  other thread.
>>   >  Here's a new thread where I can respond to Nathaniel's response.
>>   >
>>   >  On 05/10/2012 01:08 AM, Nathaniel Smith wrote:
>>   >  >  Hi Dag,
>>   >  >
>>   >  >  On Wed, May 9, 2012 at 8:44 PM, Dag Sverre Seljebotn
>>   >  >  mailto:d.s.seljeb...@astro.uio.no>
>>  >>
>>   >wrote:
>>   >  >>  I'm a heavy user of masks, which are used to make data NA in the
>>   >  >>  statistical sense. The setting is that we have to mask out the
>>   >  radiation
>>   >  >>  coming from the Milky Way in full-sky images of the Cosmic
>>  Microwave
>>   >  >>  Background. There's data, but we know we can't trust it, so we
>>   >  make it
>>   >  >>  NA. But we also do play around with different masks.
>>   >  >
>>   >  >  Oh, this is great -- that means you're one of the users that I
>>  wasn't
>>   >  >  sure existed or not :-). Now I know!
>>   >  >
>>   >  >>  Today we keep the mask in a seperate array, and to zero-mask we 
>> do
>>   >  >>
>>   >  >>  masked_data = data * mask
>>   >  >>
>>   >  >>  or
>>   >  >>
>>   >  >>  masked_data = data.copy()
>>   >  >>  masked_data[mask == 0] = np.nan # soon np.NA
>>   >  >>
>>   >  >>  depending on the circumstances.
>>   >  >>
>>   >  >>  Honestly, API-wise, this is as good as its gets for us. Nice and
>>   >  >>  transparent, no new semantics to learn in the special case of
>>  masks.
>>   >  >>
>>   >  >>  Now, this has performance issues: Lots of memory use, extra
>>   >  transfers
>>   >  >>  over the memory bus.
>>   >  >
>>   >  >  Right -- this is a case where (in the NA-overview terminology)
>>  masked
>>   >  >  storage+NA semantics would be useful.
>>   >  >
>>   >  >>  BUT, NumPy has that problem all over the place, even for "x + y
>>   >  + z"!
>>   >  >>  Solving it in the special case of masks, by making a new API,
>>   >  seems a
>>   >  >>  bit myopic to me.
>>   >  >>
>>   >  >>  IMO, that's much better solved at the fundamental level. As an
>>   >  >>  *illustration*:
>>   >  >>
>>   >  >>  with np.lazy:
>>   >  >>   masked_data1 = data * mask1
>>   >  >>   masked_data2 = data * (mask1 | mask2)
>>   >  >>   masked_data3 = (x + y + z) * (mask1&   mask3)
>>   >  >>
>>   >  >>  This would create three "generator arrays" that would
>>  zero-mask the
>>   >  >>  arrays (and perform the three-term addition...) upon request.
>>   >  You could
>>   >  >>  slice the generator arrays as you wish, and by that slice the
>>   >  data and
>>   >  >>  the mask in one operation. Obviously this could handle
>>   >  NA-masking too.
>>   >  >>
>>   >  >>  You can probably do this today with Theano and numexpr, and I
>>  think
>>   >  >>  Travis mentioned that "generator arrays" are on his radar for 
>> core
>>   >  NumPy.
>>   >  >
>>   >  >  Implementing this today would require some black magic hacks,
>>  because
>>   >  >  on entry/exit to the context manager you'd have to "reach up"
>>   >  into the
>>   >  >  calling scope and replace all the ndarray's with LazyArrays and
>>  then
>>   >  >  vice-versa. This is actually totally possible:
>>   >  >  https://gist.github.com/2347382
>>   >  >  but I'm not sure I'd call it *wise*. (You could probably avoid 
>> the
>>   >  >  truly horrible set_globals_dict part of that gist, though.)
>>  Might be
>>   >  >  fun to prototype, though...
>>   >
>>   >  1) My main point was just that I believe masked arrays is
>>  something that
>>   >  to me feels immature, and that it is the kind of thing that
>>  should be
>>   >  constructed from simpler primitives. And that NumPy should
>>  focus on
>>   >  simple primitives. You could make it
>>   >
>>   >
>>   >  I can't disagree, as I suggested the same as a possibility myself ;)
>>   >  There is a lot of infrastructure now in numpy, but given the use
>>  cases
>>   >  I'm tending towards the view that masked arrays should be left to
>>  

Re: [Numpy-discussion] Masking through generator arrays

2012-05-10 Thread Dag Sverre Seljebotn
On 05/10/2012 11:38 AM, Dag Sverre Seljebotn wrote:
> On 05/10/2012 10:40 AM, Charles R Harris wrote:
>>
>>
>> On Thu, May 10, 2012 at 1:10 AM, Dag Sverre Seljebotn
>> mailto:d.s.seljeb...@astro.uio.no>>  wrote:
>>
>>  On 05/10/2012 06:18 AM, Charles R Harris wrote:
>>   >
>>   >
>>   >  On Wed, May 9, 2012 at 9:54 PM, Dag Sverre Seljebotn
>>   >  mailto:d.s.seljeb...@astro.uio.no>
>>  >  >>  wrote:
>>   >
>>   >  Sorry everyone for being so dense and contaminating that
>>  other thread.
>>   >  Here's a new thread where I can respond to Nathaniel's response.
>>   >
>>   >  On 05/10/2012 01:08 AM, Nathaniel Smith wrote:
>>   >  >  Hi Dag,
>>   >  >
>>   >  >  On Wed, May 9, 2012 at 8:44 PM, Dag Sverre Seljebotn
>>   >  >  mailto:d.s.seljeb...@astro.uio.no>
>>  >>
>>   >wrote:
>>   >  >>  I'm a heavy user of masks, which are used to make data NA in the
>>   >  >>  statistical sense. The setting is that we have to mask out the
>>   >  radiation
>>   >  >>  coming from the Milky Way in full-sky images of the Cosmic
>>  Microwave
>>   >  >>  Background. There's data, but we know we can't trust it, so we
>>   >  make it
>>   >  >>  NA. But we also do play around with different masks.
>>   >  >
>>   >  >  Oh, this is great -- that means you're one of the users that I
>>  wasn't
>>   >  >  sure existed or not :-). Now I know!
>>   >  >
>>   >  >>  Today we keep the mask in a seperate array, and to zero-mask we 
>> do
>>   >  >>
>>   >  >>  masked_data = data * mask
>>   >  >>
>>   >  >>  or
>>   >  >>
>>   >  >>  masked_data = data.copy()
>>   >  >>  masked_data[mask == 0] = np.nan # soon np.NA
>>   >  >>
>>   >  >>  depending on the circumstances.
>>   >  >>
>>   >  >>  Honestly, API-wise, this is as good as its gets for us. Nice and
>>   >  >>  transparent, no new semantics to learn in the special case of
>>  masks.
>>   >  >>
>>   >  >>  Now, this has performance issues: Lots of memory use, extra
>>   >  transfers
>>   >  >>  over the memory bus.
>>   >  >
>>   >  >  Right -- this is a case where (in the NA-overview terminology)
>>  masked
>>   >  >  storage+NA semantics would be useful.
>>   >  >
>>   >  >>  BUT, NumPy has that problem all over the place, even for "x + y
>>   >  + z"!
>>   >  >>  Solving it in the special case of masks, by making a new API,
>>   >  seems a
>>   >  >>  bit myopic to me.
>>   >  >>
>>   >  >>  IMO, that's much better solved at the fundamental level. As an
>>   >  >>  *illustration*:
>>   >  >>
>>   >  >>  with np.lazy:
>>   >  >>   masked_data1 = data * mask1
>>   >  >>   masked_data2 = data * (mask1 | mask2)
>>   >  >>   masked_data3 = (x + y + z) * (mask1&   mask3)
>>   >  >>
>>   >  >>  This would create three "generator arrays" that would
>>  zero-mask the
>>   >  >>  arrays (and perform the three-term addition...) upon request.
>>   >  You could
>>   >  >>  slice the generator arrays as you wish, and by that slice the
>>   >  data and
>>   >  >>  the mask in one operation. Obviously this could handle
>>   >  NA-masking too.
>>   >  >>
>>   >  >>  You can probably do this today with Theano and numexpr, and I
>>  think
>>   >  >>  Travis mentioned that "generator arrays" are on his radar for 
>> core
>>   >  NumPy.
>>   >  >
>>   >  >  Implementing this today would require some black magic hacks,
>>  because
>>   >  >  on entry/exit to the context manager you'd have to "reach up"
>>   >  into the
>>   >  >  calling scope and replace all the ndarray's with LazyArrays and
>>  then
>>   >  >  vice-versa. This is actually totally possible:
>>   >  >  https://gist.github.com/2347382
>>   >  >  but I'm not sure I'd call it *wise*. (You could probably avoid 
>> the
>>   >  >  truly horrible set_globals_dict part of that gist, though.)
>>  Might be
>>   >  >  fun to prototype, though...
>>   >
>>   >  1) My main point was just that I believe masked arrays is
>>  something that
>>   >  to me feels immature, and that it is the kind of thing that
>>  should be
>>   >  constructed from simpler primitives. And that NumPy should
>>  focus on
>>   >  simple primitives. You could make it
>>   >
>>   >
>>   >  I can't disagree, as I suggested the same as a possibility myself ;)
>>   >  There is a lot of infrastructure now in numpy, but given the use
>>  cases
>>   >  I'm tending towards the view that masked arrays should be left to
>>  

Re: [Numpy-discussion] Missing data wrap-up and request for comments

2012-05-10 Thread Nathaniel Smith
Hi Matthew,

On Thu, May 10, 2012 at 12:01 AM, Matthew Brett  wrote:
>> The third proposal is certainly the best one from Cython's perspective;
>> and I imagine for those writing C extensions against the C API too.
>> Having PyType_Check fail for ndmasked is a very good way of having code
>> fail that is not written to take masks into account.
>
> Mark, Nathaniel - can you comment how your chosen approaches would
> interact with extension code?
>
> I'm guessing the bitpattern dtypes would be expected to cause
> extension code to choke if the type is not supported?

That's pretty much how I'm imagining it, yes. Right now if you have,
say, a Cython function like

cdef f(np.ndarray[double] a):
...

and you do f(np.zeros(10, dtype=int)), then it will error out, because
that function doesn't know how to handle ints, only doubles. The same
would apply for, say, a NA-enabled integer. In general there are
almost arbitrarily many dtypes that could get passed into any function
(including user-defined ones, etc.), so C code already has to check
dtypes for correctness.

Second order issues:
- There is certainly C code out there that just assumes that it will
only be passed an array with certain dtype (and ndim, memory layout,
etc...). If you write such C code then it's your job to make sure that
you only pass it the kinds of arrays that it expects, just like now
:-).

- We may want to do some sort of special-casing of handling for
floating point NA dtypes that use an NaN as the "magic" bitpattern,
since many algorithms *will* work with these unchanged, and it might
be frustrating to have to wait for every extension module to be
updated just to allow for this case explicitly before using them. OTOH
you can easily work around this. Like say my_qr is a legacy C function
that will in fact propagate NaNs correctly, so float NA dtypes would
Just Work -- except, it errors out at the start because it doesn't
recognize the dtype. How annoying. We *could* have some special hack
you can use to force it to work anyway (by like making the "is this
the dtype I expect?" routine lie.) But you can also just do:

  def my_qr_wrapper(arr):
if arr.dtype is a NA float dtype with NaN magic value:
  result = my_qr(arr.view(arr.dtype.base_dtype))
  return result.view(arr.dtype)
else:
  return my_qr(arr)

and hey presto, now it will correctly pass through NAs. So perhaps
it's not worth bothering with special hacks.

- Of course if  your extension function does want to handle NAs
generically, then there will be a simple C api for checking for them,
setting them, etc. Numpy needs such an API internally anyway!

-- Nathaniel
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Masking through generator arrays

2012-05-10 Thread Dag Sverre Seljebotn
On 05/10/2012 10:40 AM, Charles R Harris wrote:
>
>
> On Thu, May 10, 2012 at 1:10 AM, Dag Sverre Seljebotn
> mailto:d.s.seljeb...@astro.uio.no>> wrote:
>
> On 05/10/2012 06:18 AM, Charles R Harris wrote:
>  >
>  >
>  > On Wed, May 9, 2012 at 9:54 PM, Dag Sverre Seljebotn
>  > mailto:d.s.seljeb...@astro.uio.no>
>  >> wrote:
>  >
>  > Sorry everyone for being so dense and contaminating that
> other thread.
>  > Here's a new thread where I can respond to Nathaniel's response.
>  >
>  > On 05/10/2012 01:08 AM, Nathaniel Smith wrote:
>  > > Hi Dag,
>  > >
>  > > On Wed, May 9, 2012 at 8:44 PM, Dag Sverre Seljebotn
>  > > mailto:d.s.seljeb...@astro.uio.no>
> >>
>  >   wrote:
>  > >> I'm a heavy user of masks, which are used to make data NA in the
>  > >> statistical sense. The setting is that we have to mask out the
>  > radiation
>  > >> coming from the Milky Way in full-sky images of the Cosmic
> Microwave
>  > >> Background. There's data, but we know we can't trust it, so we
>  > make it
>  > >> NA. But we also do play around with different masks.
>  > >
>  > > Oh, this is great -- that means you're one of the users that I
> wasn't
>  > > sure existed or not :-). Now I know!
>  > >
>  > >> Today we keep the mask in a seperate array, and to zero-mask we do
>  > >>
>  > >> masked_data = data * mask
>  > >>
>  > >> or
>  > >>
>  > >> masked_data = data.copy()
>  > >> masked_data[mask == 0] = np.nan # soon np.NA
>  > >>
>  > >> depending on the circumstances.
>  > >>
>  > >> Honestly, API-wise, this is as good as its gets for us. Nice and
>  > >> transparent, no new semantics to learn in the special case of
> masks.
>  > >>
>  > >> Now, this has performance issues: Lots of memory use, extra
>  > transfers
>  > >> over the memory bus.
>  > >
>  > > Right -- this is a case where (in the NA-overview terminology)
> masked
>  > > storage+NA semantics would be useful.
>  > >
>  > >> BUT, NumPy has that problem all over the place, even for "x + y
>  > + z"!
>  > >> Solving it in the special case of masks, by making a new API,
>  > seems a
>  > >> bit myopic to me.
>  > >>
>  > >> IMO, that's much better solved at the fundamental level. As an
>  > >> *illustration*:
>  > >>
>  > >> with np.lazy:
>  > >>  masked_data1 = data * mask1
>  > >>  masked_data2 = data * (mask1 | mask2)
>  > >>  masked_data3 = (x + y + z) * (mask1&  mask3)
>  > >>
>  > >> This would create three "generator arrays" that would
> zero-mask the
>  > >> arrays (and perform the three-term addition...) upon request.
>  > You could
>  > >> slice the generator arrays as you wish, and by that slice the
>  > data and
>  > >> the mask in one operation. Obviously this could handle
>  > NA-masking too.
>  > >>
>  > >> You can probably do this today with Theano and numexpr, and I
> think
>  > >> Travis mentioned that "generator arrays" are on his radar for core
>  > NumPy.
>  > >
>  > > Implementing this today would require some black magic hacks,
> because
>  > > on entry/exit to the context manager you'd have to "reach up"
>  > into the
>  > > calling scope and replace all the ndarray's with LazyArrays and
> then
>  > > vice-versa. This is actually totally possible:
>  > > https://gist.github.com/2347382
>  > > but I'm not sure I'd call it *wise*. (You could probably avoid the
>  > > truly horrible set_globals_dict part of that gist, though.)
> Might be
>  > > fun to prototype, though...
>  >
>  > 1) My main point was just that I believe masked arrays is
> something that
>  > to me feels immature, and that it is the kind of thing that
> should be
>  > constructed from simpler primitives. And that NumPy should
> focus on
>  > simple primitives. You could make it
>  >
>  >
>  > I can't disagree, as I suggested the same as a possibility myself ;)
>  > There is a lot of infrastructure now in numpy, but given the use
> cases
>  > I'm tending towards the view that masked arrays should be left to
>  > others, at least for the time being. The question is how to
> generalize
>  > the infrastructure and what hooks to provide. I think just spending a
>  > month or two pulling stuff out is counter productive, but
> evolving the
>  > code is definitely needed. If you could familiarize yourself with
> what
>  > is in there, something that seems largely neglected by the
> critics, and
>   

Re: [Numpy-discussion] Missing data wrap-up and request for comments

2012-05-10 Thread Dag Sverre Seljebotn
On 05/10/2012 06:05 AM, Dag Sverre Seljebotn wrote:
> On 05/10/2012 01:01 AM, Matthew Brett wrote:
>> Hi,
>>
>> On Wed, May 9, 2012 at 12:44 PM, Dag Sverre Seljebotn
>>wrote:
>>> On 05/09/2012 06:46 PM, Travis Oliphant wrote:
 Hey all,

 Nathaniel and Mark have worked very hard on a joint document to try and
 explain the current status of the missing-data debate. I think they've
 done an amazing job at providing some context, articulating their views
 and suggesting ways forward in a mutually respectful manner. This is an
 exemplary collaboration and is at the core of why open source is valuable.

 The document is available here:
 https://github.com/numpy/numpy.scipy.org/blob/master/NA-overview.rst

 After reading that document, it appears to me that there are some
 fundamentally different views on how things should move forward. I'm
 also reading the document incorporating my understanding of the history,
 of NumPy as well as all of the users I've met and interacted with which
 means I have my own perspective that is not necessarily incorporated
 into that document but informs my recommendations. I'm not sure we can
 reach full consensus on this. We are also well past time for moving
 forward with a resolution on this (perhaps we can all agree on that).

 I would like one more discussion thread where the technical discussion
 can take place. I will make a plea that we keep this discussion as free
 from logical fallacies http://en.wikipedia.org/wiki/Logical_fallacy as
 we can. I can't guarantee that I personally will succeed at that, but I
 can tell you that I will try. That's all I'm asking of anyone else. I
 recognize that there are a lot of other issues at play here besides
 *just* the technical questions, but we are not going to resolve every
 community issue in this technical thread.

 We need concrete proposals and so I will start with three. Please feel
 free to comment on these proposals or add your own during the
 discussion. I will stop paying attention to this thread next Wednesday
 (May 16th) (or earlier if the thread dies) and hope that by that time we
 can agree on a way forward. If we don't have agreement, then I will move
 forward with what I think is the right approach. I will either write the
 code myself or convince someone else to write it.

 In all cases, we have agreement that bit-pattern dtypes should be added
 to NumPy. We should work on these (int32, float64, complex64, str, bool)
 to start. So, the three proposals are independent of this way forward.
 The proposals are all about the extra mask part:

 My three proposals:

 * do nothing and leave things as is

 * add a global flag that turns off masked array support by default but
 otherwise leaves things unchanged (I'm still unclear how this would work
 exactly)

 * move Mark's "masked ndarray objects" into a new fundamental type
 (ndmasked), leaving the actual ndarray type unchanged. The
 array_interface keeps the masked array notions and the ufuncs keep the
 ability to handle arrays like ndmasked. Ideally, numpy.ma
    would be changed to use ndmasked objects as their core.

 For the record, I'm currently in favor of the third proposal. Feel free
 to comment on these proposals (or provide your own).

>>>
>>> Bravo!, NA-overview.rst was an excellent read. Thanks Nathaniel and Mark!
>>
>> Yes, it is very well written, my compliments to the chefs.
>>
>>> The third proposal is certainly the best one from Cython's perspective;
>>> and I imagine for those writing C extensions against the C API too.
>>> Having PyType_Check fail for ndmasked is a very good way of having code
>>> fail that is not written to take masks into account.
>
> I want to make something more clear: There are two Cython cases; in the
> case of "cdef np.ndarray[double]" there is no problem as PEP 3118 access
> will raise an exception for masked arrays.
>
> But, there's the case where you do "cdef np.ndarray", and then proceed
> to use PyArray_DATA. Myself I do this more than PEP 3118 access; usually
> because I pass the data pointer to some C or C++ code.
>
> It'd be great to have such code be forward-compatible in the sense that
> it raises an exception when it meets a masked array. Having PyType_Check
> fail seems like the only way? Am I wrong?

I'm very sorry; I always meant PyObject_TypeCheck, not PyType_Check.

Dag

>
>
>> Mark, Nathaniel - can you comment how your chosen approaches would
>> interact with extension code?
>>
>> I'm guessing the bitpattern dtypes would be expected to cause
>> extension code to choke if the type is not supported?
>
> The proposal, as I understand it, is to use that with new dtypes (?). So
> things will often be fine for that reason:
>
> if arr.dtype == np.float32:
>   c_func

Re: [Numpy-discussion] Masking through generator arrays

2012-05-10 Thread Charles R Harris
On Thu, May 10, 2012 at 1:10 AM, Dag Sverre Seljebotn <
d.s.seljeb...@astro.uio.no> wrote:

> On 05/10/2012 06:18 AM, Charles R Harris wrote:
> >
> >
> > On Wed, May 9, 2012 at 9:54 PM, Dag Sverre Seljebotn
> > mailto:d.s.seljeb...@astro.uio.no>> wrote:
> >
> > Sorry everyone for being so dense and contaminating that other
> thread.
> > Here's a new thread where I can respond to Nathaniel's response.
> >
> > On 05/10/2012 01:08 AM, Nathaniel Smith wrote:
> >  > Hi Dag,
> >  >
> >  > On Wed, May 9, 2012 at 8:44 PM, Dag Sverre Seljebotn
> >  > mailto:d.s.seljeb...@astro.uio.no>>
> >   wrote:
> >  >> I'm a heavy user of masks, which are used to make data NA in the
> >  >> statistical sense. The setting is that we have to mask out the
> > radiation
> >  >> coming from the Milky Way in full-sky images of the Cosmic
> Microwave
> >  >> Background. There's data, but we know we can't trust it, so we
> > make it
> >  >> NA. But we also do play around with different masks.
> >  >
> >  > Oh, this is great -- that means you're one of the users that I
> wasn't
> >  > sure existed or not :-). Now I know!
> >  >
> >  >> Today we keep the mask in a seperate array, and to zero-mask we
> do
> >  >>
> >  >> masked_data = data * mask
> >  >>
> >  >> or
> >  >>
> >  >> masked_data = data.copy()
> >  >> masked_data[mask == 0] = np.nan # soon np.NA
> >  >>
> >  >> depending on the circumstances.
> >  >>
> >  >> Honestly, API-wise, this is as good as its gets for us. Nice and
> >  >> transparent, no new semantics to learn in the special case of
> masks.
> >  >>
> >  >> Now, this has performance issues: Lots of memory use, extra
> > transfers
> >  >> over the memory bus.
> >  >
> >  > Right -- this is a case where (in the NA-overview terminology)
> masked
> >  > storage+NA semantics would be useful.
> >  >
> >  >> BUT, NumPy has that problem all over the place, even for "x + y
> > + z"!
> >  >> Solving it in the special case of masks, by making a new API,
> > seems a
> >  >> bit myopic to me.
> >  >>
> >  >> IMO, that's much better solved at the fundamental level. As an
> >  >> *illustration*:
> >  >>
> >  >> with np.lazy:
> >  >>  masked_data1 = data * mask1
> >  >>  masked_data2 = data * (mask1 | mask2)
> >  >>  masked_data3 = (x + y + z) * (mask1&  mask3)
> >  >>
> >  >> This would create three "generator arrays" that would zero-mask
> the
> >  >> arrays (and perform the three-term addition...) upon request.
> > You could
> >  >> slice the generator arrays as you wish, and by that slice the
> > data and
> >  >> the mask in one operation. Obviously this could handle
> > NA-masking too.
> >  >>
> >  >> You can probably do this today with Theano and numexpr, and I
> think
> >  >> Travis mentioned that "generator arrays" are on his radar for
> core
> > NumPy.
> >  >
> >  > Implementing this today would require some black magic hacks,
> because
> >  > on entry/exit to the context manager you'd have to "reach up"
> > into the
> >  > calling scope and replace all the ndarray's with LazyArrays and
> then
> >  > vice-versa. This is actually totally possible:
> >  > https://gist.github.com/2347382
> >  > but I'm not sure I'd call it *wise*. (You could probably avoid the
> >  > truly horrible set_globals_dict part of that gist, though.) Might
> be
> >  > fun to prototype, though...
> >
> > 1) My main point was just that I believe masked arrays is something
> that
> > to me feels immature, and that it is the kind of thing that should be
> > constructed from simpler primitives. And that NumPy should focus on
> > simple primitives. You could make it
> >
> >
> > I can't disagree, as I suggested the same as a possibility myself ;)
> > There is a lot of infrastructure now in numpy, but given the use cases
> > I'm tending towards the view that masked arrays should be left to
> > others, at least for the time being. The question is how to generalize
> > the infrastructure and what hooks to provide. I think just spending a
> > month or two pulling stuff out is counter productive, but evolving the
> > code is definitely needed. If you could familiarize yourself with what
> > is in there, something that seems largely neglected by the critics, and
> > make suggestions, that would be helpful.
>
> But how on earth can I make constructive criticisms about code when I
> don't know what the purpose of that code is supposed to be?
>

What do you mean? I thought the purpose was quite clearly laid out in the
NEP. But the implementation of that purpose required some infrastructure.
The point, I suppose, is for you to suggest what would serve your use case.


>
> Are you saying you agree that the masking aspect should be banned (or at
> l

Re: [Numpy-discussion] Missing data wrap-up and request for comments

2012-05-10 Thread Scott Sinclair
On 9 May 2012 18:46, Travis Oliphant  wrote:
> The document is available here:
>    https://github.com/numpy/numpy.scipy.org/blob/master/NA-overview.rst

This is orthogonal to the discussion, but I'm curious as to why this
discussion document has landed in the website repo?

I suppose it's not a really big deal, but future uploads of the
website will now include a page at
http://numpy.scipy.org/NA-overview.html with the content of this
document. If that's desirable, I'll add a note at the top of the
overview referencing this discussion thread. If not it can be
relocated somewhere more desirable after this thread's discussion
deadline expires..

Cheers,
Scott
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Missing data wrap-up and request for comments

2012-05-10 Thread Gael Varoquaux
On Wed, May 09, 2012 at 02:35:26PM -0500, Travis Oliphant wrote:
>  Basically it buys not forcing *all* NumPy users (on the C-API level) to
>now deal with a masked array.    I know this push is a feature that is
>part of Mark's intention (as it pushes downstream libraries to think about
>missing data at a fundamental level). 

I think that this is a bad policy because:

 1. An array is not always data. I realize that there is a big push for
data-related computing lately, but I still believe that the notion
missing data makes no sens for the majority of numpy arrays 
instanciated.

 2. Not every algorithm can be made to work with missing data. I would
even say that most of the advanced algorithm do not work with missing
data.

Don't try to force upon people a problem that they do not have :).

Gael

PS: This message does not claim to take any position in the debate on
which solution for missing data is the best, because I don't think that I
have a good technical vision to back any position.
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Masking through generator arrays

2012-05-10 Thread Dag Sverre Seljebotn
On 05/10/2012 06:18 AM, Charles R Harris wrote:
>
>
> On Wed, May 9, 2012 at 9:54 PM, Dag Sverre Seljebotn
> mailto:d.s.seljeb...@astro.uio.no>> wrote:
>
> Sorry everyone for being so dense and contaminating that other thread.
> Here's a new thread where I can respond to Nathaniel's response.
>
> On 05/10/2012 01:08 AM, Nathaniel Smith wrote:
>  > Hi Dag,
>  >
>  > On Wed, May 9, 2012 at 8:44 PM, Dag Sverre Seljebotn
>  > mailto:d.s.seljeb...@astro.uio.no>>
>   wrote:
>  >> I'm a heavy user of masks, which are used to make data NA in the
>  >> statistical sense. The setting is that we have to mask out the
> radiation
>  >> coming from the Milky Way in full-sky images of the Cosmic Microwave
>  >> Background. There's data, but we know we can't trust it, so we
> make it
>  >> NA. But we also do play around with different masks.
>  >
>  > Oh, this is great -- that means you're one of the users that I wasn't
>  > sure existed or not :-). Now I know!
>  >
>  >> Today we keep the mask in a seperate array, and to zero-mask we do
>  >>
>  >> masked_data = data * mask
>  >>
>  >> or
>  >>
>  >> masked_data = data.copy()
>  >> masked_data[mask == 0] = np.nan # soon np.NA
>  >>
>  >> depending on the circumstances.
>  >>
>  >> Honestly, API-wise, this is as good as its gets for us. Nice and
>  >> transparent, no new semantics to learn in the special case of masks.
>  >>
>  >> Now, this has performance issues: Lots of memory use, extra
> transfers
>  >> over the memory bus.
>  >
>  > Right -- this is a case where (in the NA-overview terminology) masked
>  > storage+NA semantics would be useful.
>  >
>  >> BUT, NumPy has that problem all over the place, even for "x + y
> + z"!
>  >> Solving it in the special case of masks, by making a new API,
> seems a
>  >> bit myopic to me.
>  >>
>  >> IMO, that's much better solved at the fundamental level. As an
>  >> *illustration*:
>  >>
>  >> with np.lazy:
>  >>  masked_data1 = data * mask1
>  >>  masked_data2 = data * (mask1 | mask2)
>  >>  masked_data3 = (x + y + z) * (mask1&  mask3)
>  >>
>  >> This would create three "generator arrays" that would zero-mask the
>  >> arrays (and perform the three-term addition...) upon request.
> You could
>  >> slice the generator arrays as you wish, and by that slice the
> data and
>  >> the mask in one operation. Obviously this could handle
> NA-masking too.
>  >>
>  >> You can probably do this today with Theano and numexpr, and I think
>  >> Travis mentioned that "generator arrays" are on his radar for core
> NumPy.
>  >
>  > Implementing this today would require some black magic hacks, because
>  > on entry/exit to the context manager you'd have to "reach up"
> into the
>  > calling scope and replace all the ndarray's with LazyArrays and then
>  > vice-versa. This is actually totally possible:
>  > https://gist.github.com/2347382
>  > but I'm not sure I'd call it *wise*. (You could probably avoid the
>  > truly horrible set_globals_dict part of that gist, though.) Might be
>  > fun to prototype, though...
>
> 1) My main point was just that I believe masked arrays is something that
> to me feels immature, and that it is the kind of thing that should be
> constructed from simpler primitives. And that NumPy should focus on
> simple primitives. You could make it
>
>
> I can't disagree, as I suggested the same as a possibility myself ;)
> There is a lot of infrastructure now in numpy, but given the use cases
> I'm tending towards the view that masked arrays should be left to
> others, at least for the time being. The question is how to generalize
> the infrastructure and what hooks to provide. I think just spending a
> month or two pulling stuff out is counter productive, but evolving the
> code is definitely needed. If you could familiarize yourself with what
> is in there, something that seems largely neglected by the critics, and
> make suggestions, that would be helpful.

But how on earth can I make constructive criticisms about code when I 
don't know what the purpose of that code is supposed to be?

Are you saying you agree that the masking aspect should be banned (or at 
least not "core"), and asking me to look at code from that perspective 
and comment on how to get there while keeping as much as possible of the 
rest? Would that really be helpful?

Dag
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion