Re: [Numpy-discussion] NA-mask interactions with existing C code

Dag Sverre Seljebotn Thu, 10 May 2012 22:45:36 -0700

On 05/11/2012 07:36 AM, Travis Oliphant wrote:
>>>
>>> I guess this mixture of Python-API and C-API is different from the way
>>> the API tries to protect incorrect access. From the Python API, it.
>>> should let everything through, because it's for Python code to use. From
>>> the C API, it should default to not letting things through, because
>>> special NA-mask aware code needs to be written. I'm not sure if there is
>>> a reasonable approach here which works for everything.
>>
>> Does that mean you consider changing ob_type for masked arrays
>> unreasonable? They can still use the same object struct...
>>
>>>
>>>     But in general, I will often be lazy and just do
>>>
>>>     def f(np.ndarray arr):
>>>          c_func(np.PyArray_DATA(arr))
>>>
>>>     It's an exception if you don't provide an array -- so who cares. (I
>>>     guess the odds of somebody feeding a masked array to code like that,
>>>     which doesn't try to be friendly, is relatively smaller though.)
>>>
>>>
>>> This code would already fail with non-contiguous strides or byte-swapped
>>> data, so the additional NA mask case seems to fit in an already-failing
>>> category.
>>
>> Honestly! I hope you did't think I provided a full-fledged example?
>> Perhaps you'd like to point out to me that "c_func" is a bad name for a
>> function as well?
>>
>> One would of course check that things are contiguous (or pass on the
>> strides), check the dtype and dispatch to different C functions in each
>> case, etc.
>>
>> But that isn't the point. Scientific code most of the time does fall in
>> the "already-failing" category. That doesn't mean it doesn't count.
>> Let's focus on the number of code lines written and developer hours that
>> will be spent cleaning up the mess -- not the "validity" of the code in
>> question.
>>
>>>
>>>
>>>     If you know the datatype, you can really do
>>>
>>>     def f(np.ndarray[double] arr):
>>>          c_func(&arr[0])
>>>
>>>     which works with PEP 3118. But I use PyArray_DATA out of habit (and
>>>     since it works in the cases without dtype).
>>>
>>>     Frankly, I don't expect any Cython code to do the right thing here;
>>>     calling PyArray_FromAny is much more typing. And really, nobody ever
>>>     questioned that if we had an actual ndarray instance, we'd be allowed to
>>>     call PyArray_DATA.
>>>
>>>     I don't know how much Cython code is out there in the wild for which
>>>     this is a problem. Either way, it would cause something of a reeducation
>>>     challenge for Cython users.
>>>
>>>
>>> Since this style of coding already has known problems, do you think the
>>> case with NA-masks deserves more attention here? What will happen is.
>>> access to array element data without consideration of the mask, which
>>> seems similar in nature to accessing array data with the wrong stride or
>>> byte order.
>>
>> I don't agree with the premise of that paragraph. There's no reason to
>> assume that just because code doesn't call FromAny, it has problems.
>> (And I'll continue to assume that whatever array is returned from
>> "np.ascontiguousarray is really contiguous...)
>>
>> Whether it requires attention or not is a different issue though. I'm
>> not sure. I think other people should weigh in on that -- I mostly write
>> code for my own consumption.
>>
>> One should at least check pandas, scikits-image, scikits-learn, mpi4py,
>> petsc4py, and so on. And ask on the Cython users list. Hopefully it will
>> usually be PEP 3118. But now I need to turn in.
>>
>> Travis, would such a survey be likely to affect the outcome of your
>> decision in any way? Or should we just leave this for now?
>>
>
> This dialog gets at the heart of the matter, I think.   The NEP seems to want 
> NumPy to have a "better" API that always protects downstream users from 
> understanding what is actually under the covers.   It would prefer to push 
> NumPy in the direction of an array object that is fundamentally more opaque.  
>  However, the world NumPy lives in is decidedly not opaque.   There has been 
> significant education and shared understanding of what a NumPy array actually 
> *is* (a strided view of memory of a particular "dtype").   This shared 
> understanding has even been pushed into Python as the buffer protocol.    It 
> is very common for extension modules to go directly to the data they want by 
> using this understanding.
>
> This is very different from the traditional "shield your users" from how 
> things are actually done view of most object APIs.    It was actually 
> intentional.      I'm not saying that different choices could not have been 
> made or that some amount of shielding should never be contemplated.   I'm 
> just saying that NumPy has been used as a nice bridge between the world of 
> scientific computing codes that have chunks of memory allocated for 
> processing and high-level code.   Part of the reason for this bridge has been 
> the simple object model.
>
> I just don't think the NEP fully appreciates just how fundamental of a shift 
> this is in the wider NumPy community and it is not something that can be done 
> immediately or without careful attention.
>
> Dag, is an *active* member in that larger group of C-consumers of NumPy 
> arrays.  As a long-time member of that group, myself, this is where my 
> concerns are coming from.   So far I am not hearing anything to alleviate 
> those concerns.
>
> See my post in the other thread for my proposal to add a flag that allows 
> users to switch between the Python side default being ndarray's or ndmasked, 
> but they are different types at the C-level.    The proposal so far does not 
> specify whether or not ndarray or ndmasked is a subclass of the other.   
> Given the history of numpy.ma and the fact that it makes sense on the 
> C-level, I would lean toward ndmasked being a sub-class of ndarray --- thus a 
> C-user would have to do a PyArray_CheckExact to ensure they are getting a 
> base Python Array Object --- which they would have to do anyway because 
> numpy.ma arrays also pass PyArray_Check.


Making it a subclass means existing Cython code is not catered for, as 
PyObject_TypeCheck is used.

Is there a advantage for users by making it a subclass? Nobody is saying 
you couldn't 'inherit' the struct (make the ndmask struct be castable to 
a PyArrayObject*) even if that is not declared in the Python type object.

Dag
_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] NA-mask interactions with existing C code

Reply via email to