Re: [Numpy-discussion] Behaviour of copy for structured dtypes with gaps

Allan Haldane Fri, 12 Apr 2019 09:13:49 -0700

I would be much more in favor of `copy` eliminating padding in the
dtype, if dtypes with different paddings were considered equivalent.
But they are not.


Numpy has always treated dtypes with different padding bytes as
not-equal, and prints them very differently:

    >>> a = np.array([1], dtype={'names': ['f'],
    ...                          'formats': ['i4'],
    ...                          'offsets': [0]})
    >>> b = np.array([1], dtype={'names': ['f'],
    ...                          'formats': ['i4'],
    ...                          'offsets': [4]})
    >>> a.dtype == b.dtype
    False
    >>> a.dtype
    dtype([('f', '<i4')])
    >>> b.dtype
    dtype({'names':['f'], 'formats':['<i4'], 'offsets':[4], 'itemsize':8})

That is unlike strides, which are hidden from the user.

If we do a "dtype-overhaul" as has been plentifully discussed before,
there are many things we might change about structured dtypes, and
making padding be irrelevant in most operations could be a good one.

On the other hand, one of the main purposes of structured arrays appears
to be for interpreting binary blobs and for interfacing with C code with
C structs, where padding could be very important. Eg, if someone is
reading a binary file, they might want to do

    >>> np.fromfile('myfile', a.dtype, count=10)

and then it matters very greatly to them whether the dtype has padding
or not.

Best,
Allan


PS. It is unfinished, but I would like to advertise an 'ArrayCollection'
ndarray ducktype I have worked a bit on. This ducktype behaves very much
like structured arrays for indexing and assignment, but avoids all these
padding issues and in other ways is more suitable for "pandas-like"
usage than structured arrays. See the "ArrayCollection" and
"MaskedArrayCollection" classes at
https://github.com/ahaldane/ndarray_ducktypes
See the tests and doc folders for some brief example usage.



On 4/11/19 10:07 PM, Nathaniel Smith wrote:
> My concern would be that to implement (2), I think .copy() has to
> either special-case certain dtypes, or else we have to add some kind
> of "simplify for copy" operation to the dtype protocol. These both add
> architectural complexity, so maybe it's better to avoid it unless we
> have a compelling reason?
> 
> On Thu, Apr 11, 2019 at 6:51 AM Marten van Kerkwijk
> <m.h.vankerkw...@gmail.com> wrote:
>>
>> Hi All,
>>
>> An issue [1] about the copying of arrays with structured dtype raised a 
>> question about what the expected behaviour is: does copy always preserve the 
>> dtype as is, or should it remove padding?
>>
>> Specifically, consider an array with a structure with many fields, say 'a' 
>> to 'z'. Since numpy 1.16, if one does a[['a', 'z']]`, a view will be 
>> returned. In this case, its dtype will include a large offset. Now, if we 
>> copy this view, should the result have exactly the same dtype, including the 
>> large offset (i.e., the copy takes as much memory as the original full 
>> array), or should the padding be removed? From the discussion so far, it 
>> seems the logic has boiled down to a choice between:
>>
>> (1) Copy is a contract that the dtype will not vary (e.g., we also do not 
>> change endianness);
>>
>> (2) Copy is a contract that any access to the data in the array will return 
>> exactly the same result, without wasting memory and possibly optimized for 
>> access with different strides. E.g., `array[::10].copy() also compacts the 
>> result.
>>
>> An argument in favour of (2) is that, before numpy 1.16, `a[['a', 
>> 'z']].copy()` did return an array without padding. Of course, this relied on 
>> `a[['a', 'z']]` already returning a copy without padding, but still this is 
>> a regression.
>>
>> More generally, there should at least be a clear way to get the compact 
>> copy. Also, it would make sense for things like `np.save` to remove any 
>> padding (it currently does not).
>>
>> What do people think? All the best,
>>
>> Marten
>>
>> [1] https://github.com/numpy/numpy/issues/13299
>> _______________________________________________
>> NumPy-Discussion mailing list
>> NumPy-Discussion@python.org
>> https://mail.python.org/mailman/listinfo/numpy-discussion
> 
> 
> 

_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Behaviour of copy for structured dtypes with gaps

Reply via email to