from:"Allan Haldane"

Re: [Numpy-discussion] NEP 37: A dispatch protocol for NumPy-like modules

2020-02-28 Thread Allan Haldane

On 2/23/20 6:59 PM, Ralf Gommers wrote:
> One of the main rationales for the whole NEP, and the argument in
> multiple places
> (https://numpy.org/neps/nep-0037-array-module.html#opt-in-vs-opt-out-for-users)
> is that it's now opt-in while __array_function__ was opt-out. This isn't
> really true - the problem is simply *moved*, from the duck array
> libraries to the array-consuming libraries. The end user will still see
> the backwards incompatible change, with no way to turn it off. It will
> be easier with __array_module__ to warn users, but this should be
> expanded on in the NEP.

Might it be possible to flip this NEP back to opt-out while keeping the
nice simplifications and configurabile array-creation routines, relative
to __array_function__?

That is, what if we define two modules, "numpy" and "numpy_strict".
"numpy_strict" would raise an exception on duck-arrays defining
__array_module__ (as numpy currently does). "numpy" would be a wrapper
around "numpy_strict" that decorates all numpy methods with a call to
"get_array_module(inputs).func(inputs)".

Then end-user code that did "import numpy as np" would accept ducktypes
by default, while library developers who want to signal they don't
support ducktypes can opt-out by doing "import numpy_strict as np".
Issues with `np.as_array` seem mitigated compared to __array_function__
since that method would now be ducktype-aware.

Cheers,
-Allan
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] New DTypes: Are scalars a central concept in NumPy or not?

2020-02-24 Thread Allan Haldane

I have some thoughts on scalars from playing with ndarray ducktypes
(__array_function__), eg a MaskedArray ndarray-ducktype, for which I
wanted an associated "MaskedScalar" type.

In summary, the ways scalars currently work makes ducktyping
(duck-scalars) difficult:

  * numpy scalar types are not subclassable, so my duck-scalars aren't
subclasses of numpy scalars and aren't in the type hierarchy
  * even if scalars were subclassable, I would have to subclass each
scalar datatype individually to make masked versions
  * lots of code checks  `np.isinstance(var, np.float64)` which breaks
for my duck-scalars
  * it was difficult to distinguish between a duck-scalar and a duck-0d
array. The method I used in the end seems hacky.

This has led to some daydreams about how scalars should work, and also
led me last to read through your NEPs 40/41 with specific focus on what
you said about scalars, and was about to post there until I saw this
discussion. I agree with what you said in the NEPs about not making
scalars be dtype instances.

Here is what ducktypes led me to:

If we are able to do something like define a `np.numpy_scalar` type
covering all numpy scalars, which has a `.dtype` attribute like you
describe in the NEPs, then that would seem to solve the ducktype
problems above. Ducktype implementors would need to make a "duck-scalar"
type in parallel to their "duck-ndarray" type, but I found that to be
pretty easy using an abstract class in my MaskedArray ducktype, since
the MaskedArray and MaskedScalar share a lot of behavior.

A numpy_scalar type would also help solve some object-array problems if
the object scalars are wrapped in the np_scalar type. A long time ago I
started to try to fix up various funny/strange behaviors of object
datatypes, but there are lots of special cases, and the main problem was
that the returned objects (eg from indexing) were not numpy types and
did not support numpy attributes or indexing. Wrapping the returned
object in `np.numpy_scalar` might add an extra slight annoyance to
people who want to unwrap the object, but I think it would make object
arrays less buggy and make code using object arrays easier to reason
about and debug.

Finally, a few random votes/comments based on the other emails on the list:

I think scalars have a place in numpy (rather than just reusing 0d
arrays), since there is a clear use in having hashable, immutable
scalars. Structured scalars should probably be immutable.

I agree with your suggestion that scalars should not be indexable. Thus,
my duck-scalars (and proposed numpy_scalar) would not be indexable.
However, I think they should encode their datatype though a .dtype
attribute like ndarrays, rather than by inheritance.

Also, something to think about is that currently numpy scalars satisfy
the property `isinstance(np.float64(1), float)`, i.e they are within the
python numerical type hierarchy. 0d arrays do not have this property. My
proposal above would break this. I'm not sure what to think about
whether this is a good property to maintain or not.

Cheers,
Allan

On 2/21/20 8:37 PM, Sebastian Berg wrote:
> Hi all,
> 
> When we create new datatypes, we have the option to make new choices
> for the new datatypes [0] (not the existing ones).
> 
> The question is: Should every NumPy datatype have a scalar associated
> and should operations like indexing return a scalar or a 0-D array?
> 
> This is in my opinion a complex, almost philosophical, question, and we
> do not have to settle anything for a long time. But, if we do not
> decide a direction before we have many new datatypes the decision will
> make itself...
> So happy about any ideas, even if its just a gut feeling :).
> 
> There are various points. I would like to mostly ignore the technical
> ones, but I am listing them anyway here:
> 
>   * Scalars are faster (although that can be optimized likely)
> 
>   * Scalars have a lower memory footprint
> 
>   * The current implementation incurs a technical debt in NumPy.
> (I do not think that is a general issue, though. We could
> automatically create scalars for each new datatype probably.)
> 
> Advantages of having no scalars:
> 
>   * No need to keep track of scalars to preserve them in ufuncs, or
> libraries using `np.asarray`, do they need `np.asarray_or_scalar`?
> (or decide they return always arrays, although ufuncs may not)
> 
>   * Seems simpler in many ways, you always know the output will be an
> array if it has to do with NumPy.
> 
> Advantages of having scalars:
> 
>   * Scalars are immutable and we are used to them from Python.
> A 0-D array cannot be used as a dictionary key consistently [1].
> 
> I.e. without scalars as first class citizen `dict[arr1d[0]]`
> cannot work, `dict[arr1d[0].item()]` may (if `.item()` is defined,
> and e.g. `dict[arr1d[0].frozen()]` could make a copy to work. [2]
> 
>   * Object arrays as we have them now make sense, `arr1d[0]` can
>

Re: [Numpy-discussion] argmax() indexes to value

2019-11-01 Thread Allan Haldane

my thought was to try `take` or `take_along_axis`:

   ind = np.argmin(a, axis=1)
   np.take_along_axis(a, ind[:,None], axis=1)

But those functions tend to simply fall back to fancy indexing, and are
pretty slow. On my system plain fancy indexing is fastest:

>>> %timeit a[np.arange(N),ind]
1.58 µs ± 18.1 ns per loop
>>> %timeit np.take_along_axis(a, ind[:,None], axis=1)
6.49 µs ± 57.3 ns per loop
>>> %timeit np.min(a, axis=1)
9.51 µs ± 64.1 ns per loop

Probably `take_along_axis` was designed with uses like yours in mind,
but it is not very optimized.

(I think numpy is lacking a category of efficient
indexing/search/reduction functions, like 'findfirst', 'groupby',
short-circuiting any_*/all_*/nonzero, the proposed oindex/vindex, better
gufunc broadcasting. There is slow but gradual infrastructure work
towards these, potentially).

Cheers,
Allan

On 10/30/19 11:31 PM, Daniele Nicolodi wrote:
> On 30/10/2019 19:10, Neal Becker wrote:
>> max(axis=1)?
> 
> Hi Neal,
> 
> I should have been more precise in stating the problem. Getting the
> values in the array for which I'm looking at the maxima is only one step
> in a more complex piece of code for which I need the indexes along the
> second axis of the array. I would like to avoid to have to iterate the
> array more than once.
> 
> Thank you!
> 
> Cheers,
> Dan
> 
> 
>> On Wed, Oct 30, 2019, 7:33 PM Daniele Nicolodi > > wrote:
>>
>> Hello,
>>
>> this is a very basic question, but I cannot find a satisfying answer.
>> Assume a is a 2D array and that I get the index of the maximum value
>> along the second dimension:
>>
>> i = a.argmax(axis=1)
>>
>> Is there a better way to get the value of the maximum array entries
>> along the second axis other than:
>>
>> v = a[np.arange(len(a)), i]
>>
>> ??
>>
>> Thank you.
>>
>> Cheers,
>> Daniele
>> ___
>> NumPy-Discussion mailing list
>> NumPy-Discussion@python.org 
>> https://mail.python.org/mailman/listinfo/numpy-discussion
>>
>>
>> ___
>> NumPy-Discussion mailing list
>> NumPy-Discussion@python.org
>> https://mail.python.org/mailman/listinfo/numpy-discussion
>>
> 
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
> 

___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] round / set_printoptions discrepancy

2019-09-13 Thread Allan Haldane

On 9/13/19 9:26 AM, Eric Moore wrote:
> See the notes section here.  
> https://numpy.org/devdocs/reference/generated/numpy.around.html.
> 
> This note was recently added in https://github.com/numpy/numpy/pull/14392
> 
> Eric

Hmm, but this example with 16.055 shows the note still isn't quite right.

The doc suggests that the floating point error only matters for large
values or large `decimals`, but this shows it also happens for small
values. Makes sense now that I see the example.

We should tweak the docstring.

Also, I did make some notes in
https://github.com/numpy/numpy/issues/14391
for how we could "fix" this problem efficiently. Unfortunately it's far
from trivial to write a correct rounding algorithm, and I'm not sure
it's worth the effort: The round error is comparable to normal
floating-point error, and I don't think round is heavily used.

Best,
Allan


> On Fri, Sep 13, 2019 at 9:20 AM Andras Deak  > wrote:
> 
> On Fri, Sep 13, 2019 at 2:59 PM Philip Hodge  > wrote:
> >
> > On 9/13/19 8:45 AM, Irvin Probst wrote:
> > > On 13/09/2019 14:05, Philip Hodge wrote:
> > >>
> > >> Isn't that just for consistency with Python 3 round()?  I agree
> that
> > >> the discrepancy with np.set_printoptions is not necessarily
> expected,
> > >> except possibly for backwards compatibility.
> > >>
> > >>
> > >
> > > I've just checked and np.set_printoptions behaves as python's round:
> > >
> > > >>> round(16.055,2)
> > > 16.05
> > > >>> np.round(16.055,2)
> > > 16.06
> > >
> > > I don't know why round and np.round do not behave the same,
> actually I
> > > would even dare to say that I don't care :-)
> > > However np.round and np.set_printoptions should provide the same
> > > output, shouldn't they ? This discrepancy is really disturbing
> whereas
> > > consistency with python's round looks like the icing on the cake but
> > > in no way a required feature.
> > >
> >
> > Python round() is supposed to round to the nearest even value, if the
> > two closest values are equally close.  So round(16.055, 2) returning
> > 16.05 was a surprise to me.  I checked the documentation and found a
> > note that explained that this was because "most decimal fractions
> can't
> > be represented exactly as a float."  round(16.55) returns 16.6.
> >
> > Phil
> >
> > ___
> > NumPy-Discussion mailing list
> > NumPy-Discussion@python.org 
> > https://mail.python.org/mailman/listinfo/numpy-discussion
> 
> Ah, of course, endless double-precision shenanigans...
> >>> format(16.055, '.30f')
> '16.054715782905695960'
> 
> >>> format(16.55, '.30f')
> '16.550710542735760100'
> 
> András
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@python.org 
> https://mail.python.org/mailman/listinfo/numpy-discussion
> 
> 
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
> 

___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] new MaskedArray class

2019-06-24 Thread Allan Haldane

On 6/24/19 3:09 PM, Marten van Kerkwijk wrote:
> Hi Allan,
> 
> Thanks for bringing up the noclobber explicitly (and Stephan for asking
> for clarification; I was similarly confused).
> 
> It does clarify the difference in mental picture. In mine, the operation
> would indeed be guaranteed to be done on the underlying data, without
> copy and without `.filled(...)`. I should clarify further that I use
> `where` only to skip reading elements (i.e., in reductions), not writing
> elements (as you mention, the unwritten element will often be nonsense -
> e.g., wrong units - which to me is worse than infinity or something
> similar; I've not worried at all about runtime warnings). Note that my
> main reason here is not that I'm against filling with numbers for
> numerical arrays, but rather wanting to make minimal assumptions about
> the underlying data itself. This may well be a mistake (but I want to
> find out where it breaks).
> 
> Anyway, it would seem in many ways all the better that our models are
> quite different. I definitely see the advantages of your choice to
> decide one can do with masked data elements whatever is logical ahead of
> an operation!
> 
> Thanks also for bringing up a useful example with `np.dot(m, m)` -
> clearly, I didn't yet get beyond overriding ufuncs!
> 
> In my mental model, where I'd apply `np.dot` on the data and the mask
> separately, the result will be wrong, so the mask has to be set (which
> it would be). For your specific example, that might not be the best
> solution, but when using `np.dot(matrix_shaped, matrix_shaped)`, I think
> it does give the correct masking: any masked element in a matrix better
> propagate to all parts that it influences, even if there is a reduction
> of sorts happening. So, perhaps a price to pay for a function that tries
> to do multiple things.
> 
> The alternative solution in my model would be to replace `np.dot` with a
> masked-specific implementation of what `np.dot` is supposed to stand for
> (in your simple example, `np.add.reduce(np.multiply(m, m))` - more
> generally, add relevant `outer` and `axes`). This would be similar to
> what I think all implementations do for `.mean()` - we cannot calculate
> that from the data using any fill value or skipping, so rather use a
> more easily cared-for `.sum()` and divide by a suitable number of
> elements. But in both examples the disadvantage is that we took away the
> option to use the underlying class's `.dot()` or `.mean()` implementations.

Just to note, my current implementation uses the IGNORE style of mask,
so does not propagate the mask in np.dot:

>>> a = MaskedArray([[1,1,1], [1,X,1], [1,1,1]])
>>> np.dot(a, a)

MaskedArray([[3, 2, 3],
 [2, 2, 2],
 [3, 2, 3]])

I'm not at all set on that behavior and we can do something else. For
now, I chose this way since it seemed to best match the "IGNORE" mask
behavior.

The behavior you described further above where the output row/col would
be masked corresponds better to "NA" (propagating) mask behavior, which
I am leaving for later implementation.

best,
Allan

> 
> (Aside: considerations such as these underlie my longed-for exposure of
> standard implementations of functions in terms of basic ufunc calls.)
> 
> Another example of a function for which I think my model is not
> particularly insightful (and for which it is difficult to know what to
> do generally) is `np.fft.fft`. Since an fft is equivalent to a
> sine/cosine fits to data points, the answer for masked data is in
> principle quite well-defined. But much less easy to implement!
> 
> All the best,
> 
> Marten
> 
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
> 

___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] new MaskedArray class

2019-06-24 Thread Allan Haldane

On 6/24/19 12:16 PM, Stephan Hoyer wrote:
> On Mon, Jun 24, 2019 at 8:46 AM Allan Haldane  <mailto:allanhald...@gmail.com>> wrote:
> 
>  1. Making a "no-clobber" guarantee on the underlying data
> 
> 
> Hi Allan -- could kindly clarify what you mean by "no-clobber"?
> 
> Is this referring to allowing masked arrays to mutate masked data values
> in-place, even on apparently non-in-place operators? If so, that
> definitely seems like a bad idea to me. I would much rather do an
> unnecessary copy than have surprisingly non-thread-safe operations.

Yes. In my current implementation, the operation:

 >>> a = np.arange(6)
 >>> m = MaskedArray(a, mask=a < 3)
 >>> res = np.dot(m, m)

will clobber elements of a. It appears that to avoid clobbering we will
need to have np.dot make a copy. I also discuss how my implementation
clobbers views in the docs:

https://github.com/ahaldane/ndarray_ducktypes/blob/master/doc/MaskedArray.md#views-and-copies

I expect I could be convinced to make a no-clobber guarantee, if others
agree it is better to accept the performance loss by making a copy
internally.

I just still have a hard time thinking of cases where clobbering is
really that confusing, or easily avoidable by the user making an
explicit copy. I like giving the user control over whether a copy is
made or not, since I expect in the vast majority of cases a copy is
unnecessary.

I think it will be rare usage for people to hold on to the data array
("a" in the example above). Most of the time you create the MaskedArray
on data created on the spot which you never touch directly again. We are
all already used to numpy's "view" behavior (eg, for the np.array
function), where if you don't explicitly make a copy of your orginal
array you can expect further operations to modify it. Admittedly for
MaskedArray it's a bit different since apparently readonly operations
like np.dot can clobber, but again, it doesn't seem hard to know about
or burdensome to avoid by explicit copy, and can give big performance
advantages.

> 
>  If we agree that masked values will contain nonsense, it seems like a
> bad idea for those values to be easily exposed.
> 
> Further, in all the comments so far I have not seen an example of a need
> for unmasking that is not more easily, efficiently and safely achieved
> by simply creating a new MaskedArray with a different mask.
> 
> 
> My understanding is that essentially every low-level MaskedArray
> function is implemented by looking at the data and mask separately. If
> so, we should definitely expose this API directly to users (as part of
> the public API for MaskedArray), so they can write their own efficient
> algorithms.>
> As a concrete example, suppose I wanted to implement a low-level
> "grouped mean" operation for masked arrays like that found in pandas.
> This isn't a builtin NumPy function, so I would need to write this
> myself. This would be relatively straightforward to do in Numba or
> Cython with raw NumPy arrays (e.g., see my example here for a NaN
> skipping
> version: https://github.com/shoyer/numbagg/blob/v0.1.0/numbagg/grouped.py),
> but to do it efficiently you definitely don't want to make an
> unnecessary copy.
> 
> The usual reason for hiding implementation details is when we want to
> reserve the right to change them. But if we're sure about the data model
> (which I think we are for MaskedArray) then I think there's a lot of
> value in exposing it directly to users, even if it's lower level than it
> appropriate to use in most cases.

Fair enough, I think it is all right to allow people access to ._data
and make some guarantees about it if they are implementing subclasses or
defining new ducktypes.

There should be a section in the documentation describing what
guanantees we make about ._data (or ._array if we change the name) and
how/when to use it.

Best,
Allan

> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
> 

___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] new MaskedArray class

2019-06-24 Thread Allan Haldane

On 6/24/19 11:46 AM, Allan Haldane wrote:
> A no-clobber guarantee makes your "iterative mask" example solvable in
> an efficient (no-copy) way:
> 
> mask, last_mask = False
> while True:
> dat_mean = np.mean(MaskedArray(data, mask))
> mask, last_mask = np.abs(data - mask) > cutoff, mask
> if np.all(mask == last_mask):
> break

Whoops, that should read "np.abs(data - dat_mean)" in there.

Allan
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] new MaskedArray class

2019-06-24 Thread Allan Haldane

On 6/23/19 6:58 PM, Eric Wieser wrote:
> I think we’d need to consider separately the operation on the mask
> and on the data. In my proposal, the data would always do
> |np.sum(array, where=~mask)|, while how the mask would propagate
> might depend on the mask itself,
> 
> I quite like this idea, and I think Stephan’s strawman design is
> actually plausible, where |MaskedArray.mask| is either an |InvalidMask|
> or a |IgnoreMask| instance to pick between the different propagation
> types. Both classes could simply have an underlying |._array| attribute
> pointing to a duck-array of some kind that backs their boolean data.
> 
> The second version requires that you /also/ know how Mask classes
> work, and how they implement +
> 
> I remain unconvinced that Mask classes should behave differently on
> different ufuncs. I don’t think |np.minimum(ignore_na, b)| is any
> different to |np.add(ignore_na, b)| - either both should produce |b|, or
> both should produce |ignore_na|. I would lean towards produxing
> |ignore_na|, and propagation behavior differing between “ignore” and
> “invalid” only for |reduce| / |accumulate| operations, where the concept
> of skipping an application is well-defined.
> 
> Some possible follow-up questions that having two distinct masked types
> raise:
> 
>   * what if I want my data to support both invalid and skip fields at
> the same time? sum([invalid, skip, 1]) == invalid
>   * is there a use case for more that these two types of mask?
> |invalid_due_to_reason_A|, |invalid_due_to_reason_B| would be
> interesting things to track through a calculation, possibly a
> dictionary of named masks.
> 
> Eric

General comments on the last few emails:

For now I intentionally decided not to worry about NA masks in my
implementation. I want to get a first working implementation finished
for IGNORE style.

I agree it would be nice to have some support for NA style later, either
by a new MaskedArray subclass, a ducktype'd .mask attribute, or by some
other modification. In the latter category, consider that currently the
mask is stored as a boolean (1 byte) mask. One idea I have not put much
thought into is that internally we could make the mask a uint8, so
unmasked would be "0" IGNORE mask would be 1, and NA mask would be 2.
That allows mixing of mask types. Not sure how messy it would be to
implement.

For the idea of replacing the mask by a ducktype for NA style, my
instinct is that would be tricky. Many of the MaskedArray
__array_function__ method implementations, like sort and dot and many
others, do "complicated" computations using the mask that I don't think
you could easily get to work right by simply substituting a mask
ducktype. I think those would need to be reimplemented for NA style, in
other words you would need to make a MaskedArray subclass anyway.

Cheers,
Allan


> On Sun, 23 Jun 2019 at 15:28, Stephan Hoyer  > wrote:
> 
> On Sun, Jun 23, 2019 at 11:55 PM Marten van Kerkwijk
> mailto:m.h.vankerkw...@gmail.com>> wrote:
> 
> Your proposal would be something like np.sum(array,
> where=np.ones_like(array))? This seems rather verbose for a
> common operation. Perhaps np.sum(array, where=True) would
> work, making use of broadcasting? (I haven't actually
> checked whether this is well-defined yet.)
> 
> I think we'd need to consider separately the operation on the
> mask and on the data. In my proposal, the data would always do
> `np.sum(array, where=~mask)`, while how the mask would propagate
> might depend on the mask itself, i.e., we'd have different mask
> types for `skipna=True` (default) and `False` ("contagious")
> reductions, which differed in doing `logical_and.reduce` or
> `logical_or.reduce` on the mask.
> 
> 
> OK, I think I finally understand what you're getting at. So suppose
> this this how we implement it internally. Would we really insist on
> a user creating a new MaskedArray with a new mask object, e.g., with
> a GreedyMask? We could add sugar for this, but certainly
> array.greedy_masked().sum() is significantly less clear than
> array.sum(skipna=False).
> 
> I'm also a little concerned about a proliferation of
> MaskedArray/Mask types. New types are significantly harder to
> understand than new functions (or new arguments on existing
> functions). I don't know if we have enough distinct use cases for
> this many types.
> 
> Are there use-cases for propagating masks separately from
> data? If not, it might make sense to only define mask
> operations along with data, which could be much simpler.
> 
> 
> I had only thought about separating out the concern of mask
> propagation from the "MaskedArray" class to the mask proper, but
> it might indeed make things easier if the mask

Re: [Numpy-discussion] new MaskedArray class

2019-06-24 Thread Allan Haldane

On 6/22/19 11:50 AM, Marten van Kerkwijk wrote:
> Hi Allan,
> 
> I'm not sure I would go too much by what the old MaskedArray class did. 
> It indeed made an effort not to overwrite masked values with a new 
> result, even to the extend of copying back masked input data elements to 
> the output data array after an operation. But the fact that this is 
> non-sensical if the dtype changes (or the units in an operation on 
> quantities) suggests that this mental model simply does not work.
> 
> I think a sensible alternative mental model for the MaskedArray class is 
> that all it does is forward any operations to the data it holds and 
> separately propagate a mask, ORing elements together for binary 
> operations, etc., and explicitly skipping masked elements in reductions 
> (ideally using `where` to be as agnostic as possible about the 
> underlying data, for which, e.g., setting masked values to `0` for 
> `np.reduce.add` may or may not be the right thing to do - what if they 
> are string? >
> With this mental picture, the underlying data are always have 
> well-defined meaning: they have been operated on as if the mask did not 
> exist. There then is also less reason to try to avoid getting it back to 
> the user.
> 
> As a concrete example (maybe Ben has others): in astropy we have a 
> sigma-clipping average routine, which uses a `MaskedArray` to 
> iteratively mask items that are too far off from the mean; here, the 
> mask varies each iteration (an initially masked element can come back 
> into play), but the data do not.
> 
> All the best,
> 
> Marten

I want to distinguish 3 different behaviors we are discussing:

 1. Making a "no-clobber" guarantee on the underlying data
 2. whether the data under the mask should have "sensible" values
 3. whether to allow "unmasking"

1. For "no clobber"
===

I agree this would be a nice guarantee to make, as long as it does not
impose too much burden on us. Sometimes it is better not to chain
ourselves down.

By using "where", I can indeed make many api-functions and ufuncs
guarantee no-clobber. There are still a bunch of functions that seem
tricky to implement currently without either clobbering or making a
copy: dot, cross, outer, sort/argsort/searchsorted, correlate, convolve,
nonzero, and similar functions.

I'll think about how to implement those in a no-clobber way. Any
suggestions welcome, eg for np.dot/outer.

A no-clobber guarantee makes your "iterative mask" example solvable in
an efficient (no-copy) way:

mask, last_mask = False
while True:
dat_mean = np.mean(MaskedArray(data, mask))
mask, last_mask = np.abs(data - mask) > cutoff, mask
if np.all(mask == last_mask):
break

The MaskedArray constructor should have pretty minimal overhead.

2. Whether we can make "masked" data keep sensible values
=

I'm not confident this is a good guarantee to make. Certainly in a
simple case like c = a + b we can make masked values in c contain the
correct sum of the masked data in a and b. But I foresee complications
in other cases. For instance,

MaskedArray([4,4,4])/MaskedArray([0, 1, 2], mask=[1,0,1])

If we use the "where" ufunc argument to evaluate the division operation,
a division-by-0 warning is not output which is good. However, if we use
"where" index 2 does not get updated correctly and will contain
"nonsense". If we use "where" a lot (which I think we should) we can
expect a lot of uninitialized masked values to commonly appear.

So my interpretation is that this comment:

> I think a sensible alternative mental model for the MaskedArray class
is that all it does is forward any operations to the data it holds and
separately propagate a mask

should only roughly be true, in the sense that we will not simply
"forward any operations" but we will also use "where" arguments which
produce nonsense masked values.

3. Whether to allow unmasking
=

If we agree that masked values will contain nonsense, it seems like a
bad idea for those values to be easily exposed.

Further, in all the comments so far I have not seen an example of a need
for unmasking that is not more easily, efficiently and safely achieved
by simply creating a new MaskedArray with a different mask.

If super-users want to access the ._data attribute they can, but I don't
think it should be recommended.

Normal users can use the ".filled" method, which by the way I
implemented to optionally support returning a readonly view rather than
a copy (the default).

Cheers,
Allan

> 
> On Sat, Jun 22, 2019 at 10:54 AM Allan Haldane  <mailto:allanhald...@gmail.com>> wrote:
> 
> On 6/21/19

Re: [Numpy-discussion] new MaskedArray class

2019-06-22 Thread Allan Haldane

On 6/21/19 2:37 PM, Benjamin Root wrote:

Just to note, data that is masked isn't always garbage. There are plenty
of use-cases where one may want to temporarily apply a mask for a set of
computation, or possibly want to apply a series of different masks to
the data. I haven't read through this discussion deeply enough, but is
this new class going to destroy underlying masked data? and will it be
possible to swap out masks?

Cheers!
Ben Root

Indeed my implementation currently feels free to clobber the data at 
masked positions and makes no guarantees not to.

I'd like to try to support reasonable use-cases like yours though. A few 
thoughts:

First, the old np.ma.MaskedArray explicitly does not promise to preserve 
masked values, with a big warning in the docs. I can't recall the 
examples, but I remember coming across cases where clobbering happens. 
So arguably your behavior was never supported, and perhaps this means 
that no-clobber behavior is difficult to reasonably support.

Second, the old np.ma.MaskedArray avoids frequent clobbering by making 
lots of copies. Therefore, in most cases you will not lose any 
performance in my new MaskedArray relative to the old one by making an 
explicit copy yourself. I.e, is it problematic to have to do

>>> result = MaskedArray(data.copy(), trial_mask).sum()

instead of

>>> marr.mask = trial_mask
>>> result = marr.sum()

since they have similar performance?

Third, in the old np.ma.MaskedArray masked positions are very often 
"effectively" clobbered, in the sense that they are not computed. For 
example, if you do "c = a+b", and then change the mask of c, the values 
at masked position of the result of (a+b) do not correspond to the sum 
of the masked values in a and b. Thus, by "unmasking" c you are exposing 
nonsense values, which to me seems likely to cause heisenbugs.

In summary, by not making no-clobber guarantees and by strictly 
preventing exposure of nonsense values, I suspect that: 1. my new code 
is simpler and faster by avoiding lots of copies, and forces copies to 
be explicit in user code. 2. disallowing direct modification of the mask 
lowers the "API surface area" making people's MaskedArray code less 
buggy and easier to read: Exposure of nonsense values by "unmasking" is 
one less possibility to keep in mind.

Best,
Allan

On Thu, Jun 20, 2019 at 12:44 PM Allan Haldane mailto:allanhald...@gmail.com>> wrote:

On 6/19/19 10:19 PM, Marten van Kerkwijk wrote:
> Hi Allan,
>
> This is very impressive! I could get the tests that I wrote for my
class
> pass with yours using Quantity with what I would consider very minimal
> changes. I only could not find a good way to unmask data (I like the
> idea of setting the mask on some elements via `ma[item] = X`); is this
> on purpose?

Yes, I want to make it difficult for the user to access the garbage
values under the mask, which are often clobbered values. The only way to
"remove" a masked value is by replacing it with a new non-masked value.

> Anyway, it would seem easily at the point where I should comment
on your
> repository rather than in the mailing list!

To make further progress on this encapsulation idea I need a more
complete ducktype to pass into MaskedArray to test, so that's what I'll
work on next, when I have time. I'll either try to finish my
ArrayCollection type, or try making a simple NDunit ducktype
piggybacking on astropy's Unit.

    Best,
Allan

>
> All the best,
>
> Marten
>
>
> On Wed, Jun 19, 2019 at 5:45 PM Allan Haldane
mailto:allanhald...@gmail.com>
> <mailto:allanhald...@gmail.com <mailto:allanhald...@gmail.com>>>
wrote:
>
>     On 6/18/19 2:04 PM, Marten van Kerkwijk wrote:
>     >
>     >
>     > On Tue, Jun 18, 2019 at 12:55 PM Allan Haldane
>     mailto:allanhald...@gmail.com>
<mailto:allanhald...@gmail.com <mailto:allanhald...@gmail.com>>
>     > <mailto:allanhald...@gmail.com
<mailto:allanhald...@gmail.com> <mailto:allanhald...@gmail.com
<mailto:allanhald...@gmail.com>>>>
>     wrote:
>     > 
>     >
>     >     > This may be too much to ask from the initializer, but, if
>     so, it still
>     >     > seems most useful if it is made as easy as possible to do,
>     say, `class
>     >     > MaskedQuantity(Masked, Quantity): `.
>     >
>     >     Currently MaskedArray does not accept ducktypes as
underlying
>     arrays,
>     >     but I think it shouldn't be too hard to modify it to do so.
>     Good idea!
>

Re: [Numpy-discussion] new MaskedArray class

2019-06-20 Thread Allan Haldane

On 6/19/19 10:19 PM, Marten van Kerkwijk wrote:
> Hi Allan,
> 
> This is very impressive! I could get the tests that I wrote for my class
> pass with yours using Quantity with what I would consider very minimal
> changes. I only could not find a good way to unmask data (I like the
> idea of setting the mask on some elements via `ma[item] = X`); is this
> on purpose?

Yes, I want to make it difficult for the user to access the garbage
values under the mask, which are often clobbered values. The only way to
"remove" a masked value is by replacing it with a new non-masked value.


> Anyway, it would seem easily at the point where I should comment on your
> repository rather than in the mailing list!

To make further progress on this encapsulation idea I need a more
complete ducktype to pass into MaskedArray to test, so that's what I'll
work on next, when I have time. I'll either try to finish my
ArrayCollection type, or try making a simple NDunit ducktype
piggybacking on astropy's Unit.

Best,
Allan


> 
> All the best,
> 
> Marten
> 
> 
> On Wed, Jun 19, 2019 at 5:45 PM Allan Haldane  <mailto:allanhald...@gmail.com>> wrote:
> 
> On 6/18/19 2:04 PM, Marten van Kerkwijk wrote:
> >
> >
> > On Tue, Jun 18, 2019 at 12:55 PM Allan Haldane
> mailto:allanhald...@gmail.com>
> > <mailto:allanhald...@gmail.com <mailto:allanhald...@gmail.com>>>
> wrote:
> > 
> >
> >     > This may be too much to ask from the initializer, but, if
> so, it still
> >     > seems most useful if it is made as easy as possible to do,
> say, `class
> >     > MaskedQuantity(Masked, Quantity): `.
> >
> >     Currently MaskedArray does not accept ducktypes as underlying
> arrays,
> >     but I think it shouldn't be too hard to modify it to do so.
> Good idea!
> >
> >
> > Looking back at my trial, I see that I also never got to duck arrays -
> > only ndarray subclasses - though I tried to make the code as
> agnostic as
> > possible.
> >
> > (Trial at
> >
> 
> https://github.com/astropy/astropy/compare/master...mhvk:utils-masked-class?expand=1)
> >
> >     I already partly navigated this mixin-issue in the
> >     "MaskedArrayCollection" class, which essentially does
> >     ArrayCollection(MaskedArray(array)), and only takes about 30
> lines of
> >     boilerplate. That's the backwards encapsulation order from
> what you want
> >     though.
> >
> >
> > Yes, indeed, from a quick trial `MaskedArray(np.arange(3.) * u.m,
> > mask=[True, False, False])` does indeed not have a `.unit` attribute
> > (and cannot represent itself...); I'm not at all sure that my
> method of
> > just creating a mixed class is anything but a recipe for disaster,
> though!
> 
> Based on your suggestion I worked on this a little today, and now my
> MaskedArray more easily encapsulates both ducktypes and ndarray
> subclasses (pushed to repo). Here's an example I got working with masked
> units using unyt:
> 
> [1]: from MaskedArray import X, MaskedArray, MaskedScalar
> 
> [2]: from unyt import m, km
> 
> [3]: import numpy as np
> 
> [4]: uarr = MaskedArray([1., 2., 3.]*km, mask=[0,1,0])
> 
> [5]: uarr
> 
> MaskedArray([1., X , 3.])
> [6]: uarr + 1*m
> 
> MaskedArray([1.001, X    , 3.001])
> [7]: uarr.filled()
> 
> unyt_array([1., 0., 3.], 'km')
> [8]: np.concatenate([uarr, 2*uarr]).filled()
> unyt_array([1., 0., 3., 2., 0., 6.], '(dimensionless)')
> 
> The catch is the ducktype/subclass has to rigorously follow numpy's
> indexing rules, including distinguishing 0d arrays from scalars. For now
> only I used unyt in the example above since it happens to be less strict
>  about dimensionless operations than astropy.units which trips up my
> repr code. (see below for example with astropy.units). Note in the last
> line I lost the dimensions, but that is because unyt does not handle
> np.concatenate. To get that to work we need a true ducktype for units.
> 
> The example above doesn't expose the ".units" attribute outside the
> MaskedArray, and it doesn't print the units in the repr. But you can
> access them using "filled".
> 
> While I could make MaskedArray forward unknown attribute accesses to the
> encapsulated array, that seems a bit dangerous/bug-prone at first
> glance, so probably I want to require th

Re: [Numpy-discussion] new MaskedArray class

2019-06-19 Thread Allan Haldane

On 6/18/19 2:04 PM, Marten van Kerkwijk wrote:
> 
> 
> On Tue, Jun 18, 2019 at 12:55 PM Allan Haldane  <mailto:allanhald...@gmail.com>> wrote:
> 
> 
> > This may be too much to ask from the initializer, but, if so, it still
> > seems most useful if it is made as easy as possible to do, say, `class
> > MaskedQuantity(Masked, Quantity): `.
> 
> Currently MaskedArray does not accept ducktypes as underlying arrays,
> but I think it shouldn't be too hard to modify it to do so. Good idea!
> 
> 
> Looking back at my trial, I see that I also never got to duck arrays -
> only ndarray subclasses - though I tried to make the code as agnostic as
> possible.
> 
> (Trial at
> https://github.com/astropy/astropy/compare/master...mhvk:utils-masked-class?expand=1)
> 
> I already partly navigated this mixin-issue in the
> "MaskedArrayCollection" class, which essentially does
> ArrayCollection(MaskedArray(array)), and only takes about 30 lines of
> boilerplate. That's the backwards encapsulation order from what you want
> though.
> 
> 
> Yes, indeed, from a quick trial `MaskedArray(np.arange(3.) * u.m,
> mask=[True, False, False])` does indeed not have a `.unit` attribute
> (and cannot represent itself...); I'm not at all sure that my method of
> just creating a mixed class is anything but a recipe for disaster, though!

Based on your suggestion I worked on this a little today, and now my
MaskedArray more easily encapsulates both ducktypes and ndarray
subclasses (pushed to repo). Here's an example I got working with masked
units using unyt:

[1]: from MaskedArray import X, MaskedArray, MaskedScalar

[2]: from unyt import m, km

[3]: import numpy as np

[4]: uarr = MaskedArray([1., 2., 3.]*km, mask=[0,1,0])

[5]: uarr

MaskedArray([1., X , 3.])
[6]: uarr + 1*m

MaskedArray([1.001, X, 3.001])
[7]: uarr.filled()

unyt_array([1., 0., 3.], 'km')
[8]: np.concatenate([uarr, 2*uarr]).filled()
unyt_array([1., 0., 3., 2., 0., 6.], '(dimensionless)')

The catch is the ducktype/subclass has to rigorously follow numpy's
indexing rules, including distinguishing 0d arrays from scalars. For now
only I used unyt in the example above since it happens to be less strict
 about dimensionless operations than astropy.units which trips up my
repr code. (see below for example with astropy.units). Note in the last
line I lost the dimensions, but that is because unyt does not handle
np.concatenate. To get that to work we need a true ducktype for units.

The example above doesn't expose the ".units" attribute outside the
MaskedArray, and it doesn't print the units in the repr. But you can
access them using "filled".

While I could make MaskedArray forward unknown attribute accesses to the
encapsulated array, that seems a bit dangerous/bug-prone at first
glance, so probably I want to require the user to make a MaskedArray
subclass to do so. I've just started playing with that (probably buggy),
and Ive attached subclass examples for astropy.unit and unyt, with some
example output below.

Cheers,
Allan



Example using the attached astropy unit subclass:

>>> from astropy.units import m, km, s
>>> uarr = MaskedQ(np.ones(3), units=km, mask=[0,1,0])
>>> uarr
MaskedQ([1., X , 1.], units=km)
>>> uarr.units
km
>>> uarr + (1*m)
MaskedQ([1.001, X, 1.001], units=km)
>>> uarr/(1*s)
MaskedQ([1., X , 1.], units=km / s)
>>> (uarr*(1*m))[1:]
MaskedQ([X , 1.], units=km m)
>>> np.add.outer(uarr, uarr)
MaskedQ([[2., X , 2.],
 [X , X , X ],
 [2., X , 2.]], units=km)
>>> print(uarr)
[1. X  1.] km m

Cheers,
Allan


> > Even if this impossible, I think it is conceptually useful to think
> > about what the masking class should do. My sense is that, e.g., it
> > should not attempt to decide when an operation succeeds or not,
> but just
> > "or together" input masks for regular, multiple-input functions,
> and let
> > the underlying arrays skip elements for reductions by using `where`
> > (hey, I did implement that for a reason... ;-). In particular, it
> > suggests one should not have things like domains and all that (I never
> > understood why `MaskedArray` did that). If one wants more, the class
> > should provide a method that updates the mask (a sensible default
> might
> > be `mask |= ~np.isfinite(result)` - here, the class being masked
> should
> > logically support ufuncs and functions, so it can decide what
> "isfinite"
> > means).
> 
> I agree it would be nice to remove domains. It would make life easier,
> and I could remov

Re: [Numpy-discussion] new MaskedArray class

2019-06-18 Thread Allan Haldane

On 6/18/19 10:06 AM, Marten van Kerkwijk wrote:
> Hi Allen,
> 
> Thanks for the message and link! In astropy, we've been struggling with
> masking a lot, and one of the main conclusions I have reached is that
> ideally one has a more abstract `Masked` class that can take any type of
> data (including `ndarray`, of course), and behaves like that data as
> much as possible, to the extent that if, e.g., I create a
> `Masked(Quantity(..., unit), mask=...)`, the instance will have a
> `.unit` attribute and perhaps even `isinstance(..., Quantity)` will
> hold. And similarly for `Masked(Time(...), mask=...)`,
> `Masked(SkyCoord(...), mask=...)`, etc. In a way, `Masked` would be a
> kind of Mixin-class that just tracks a mask attribute.
> 
> This may be too much to ask from the initializer, but, if so, it still
> seems most useful if it is made as easy as possible to do, say, `class
> MaskedQuantity(Masked, Quantity): `.

Currently MaskedArray does not accept ducktypes as underlying arrays,
but I think it shouldn't be too hard to modify it to do so. Good idea!

I already partly navigated this mixin-issue in the
"MaskedArrayCollection" class, which essentially does
ArrayCollection(MaskedArray(array)), and only takes about 30 lines of
boilerplate. That's the backwards encapsulation order from what you want
though.

> Even if this impossible, I think it is conceptually useful to think
> about what the masking class should do. My sense is that, e.g., it
> should not attempt to decide when an operation succeeds or not, but just
> "or together" input masks for regular, multiple-input functions, and let
> the underlying arrays skip elements for reductions by using `where`
> (hey, I did implement that for a reason... ;-). In particular, it
> suggests one should not have things like domains and all that (I never
> understood why `MaskedArray` did that). If one wants more, the class
> should provide a method that updates the mask (a sensible default might
> be `mask |= ~np.isfinite(result)` - here, the class being masked should
> logically support ufuncs and functions, so it can decide what "isfinite"
> means).

I agree it would be nice to remove domains. It would make life easier,
and I could remove a lot of twiddly code! I kept it in for now to
minimize the behavior changes from the old MaskedArray.

> In any case, I would think that a basic truth should be that everything
> has a mask with a shape consistent with the data, so
> 1. Each complex numbers has just one mask, and setting `a.imag` with a
> masked array should definitely propagate the mask.
> 2. For a masked array with structured dtype, I'd similarly say that the
> default is for a mask to have the same shape as the array. But that
> something like your collection makes sense for the case where one wants
> to mask items in a structure.

Agreed that we should have a single bool per complex or structured
element, and the mask shape is the same as the array shape. That's how I
implemented it. But there is still a problem with complex.imag assignment:

>>> a = MaskedArray([1j, 2, X])
>>> i = a.imag
>>> i[:] = MaskedArray([1, X, 1])

If we make the last line copy the mask to the original array, what
should the real part of a[2] be? Conversely, if we don't copy the mask,
what should the imag part of a[1] be? It seems like we might "want" the
masks to be OR'd instead, but then should i[2] be masked after we just
set it to 1?

> All the best,
> 
> Marten
> 
> p.s. I started trying to implement the above "Mixin" class; will try to
> clean that up a bit so that at least it uses `where` and push it up.

I played with "where", but didn't include it since 1.17 is not released.
To avoid duplication of effort, I've attached a diff of what I tried. I
actually get a slight slowdown of about 10% by using where...

If you make progress with the mixin, a push is welcome. I imagine a
problem is going to be that np.isscalar doesn't work to detect duck scalars.

Cheers,
Allan

> On Mon, Jun 17, 2019 at 6:43 PM Allan Haldane  <mailto:allanhald...@gmail.com>> wrote:
> 
> Hi all,
> 
> Chuck suggested we think about a MaskedArray replacement for 1.18.
> 
> A few months ago I did some work on a MaskedArray replacement using
> `__array_function__`, which I got mostly working. It seems like a good
> time to bring it up for discussion now. See it at:
> 
> https://github.com/ahaldane/ndarray_ducktypes
> 
> It should be very usable, it has docs you can read, and it passes a
> pytest-suite with 1024 tests adapted from numpy's MaskedArray tests.
> What is missing? It needs even more tests for new functionality, and a
> couple numpy-API functions are missing, in particu

[Numpy-discussion] new MaskedArray class

2019-06-17 Thread Allan Haldane

Hi all,

Chuck suggested we think about a MaskedArray replacement for 1.18.

A few months ago I did some work on a MaskedArray replacement using
`__array_function__`, which I got mostly working. It seems like a good
time to bring it up for discussion now. See it at:

https://github.com/ahaldane/ndarray_ducktypes

It should be very usable, it has docs you can read, and it passes a
pytest-suite with 1024 tests adapted from numpy's MaskedArray tests.
What is missing? It needs even more tests for new functionality, and a
couple numpy-API functions are missing, in particular `np.median`,
`np.percentile`, `np.cov`, and `np.corrcoef`. I'm sure other devs could
also find many things to improve too.

Besides fixing many little annoyances from MaskedArray, and simplifying
the logic by always storing the mask in full, it also has new features.
For instance it allows the use of a "X" variable to mark masked
locations during array construction, and I solve the issue of how to
mask individual fields of a structured array differently.

At this point I would by happy to get some feedback on the design and
what seems good or bad. If it seems like a good start, I'd be happy to
move it into a numpy repo of some sort for further collaboration &
discussion, and maybe into 1.18. At the least I hope it can serve as a
design study of what we could do.





Let me also drop here two more interesting detailed issues:

First, the issue of what to do about .real and .imag of complex arrays,
and similarly about field-assignment of structured arrays. The problem
is that we have a single mask bool per element of a complex array, but
what if someone does `arr.imag = MaskedArray([1,X,1])`? How should the
mask of the original array change? Should we make .real and .imag readonly?

Second, a more general issue of how to ducktype scalars when using
`__array_function__` which I think many ducktype implementors will have
to face. For MaskedArray, I created an associated "MaskedScalar" type.
However, MaskedScalar has to behave differently from normal numpy
scalars in a number of ways: It is not part of the numpy scalar
hierarchy, it fails checks `isinstance(var, np.floating)`, and
np.isscalar returns false. Numpy scalar types cannot be subclassed. We
have discussed before the need to have distinction between 0d-arrays and
scalars, so we shouldn't just use a 0d (in fact, this makes printing
very difficult). This leads me to think that in future dtype-overhaul
plans, we should consider creating a subclassable `np.scalar` base type
to wrap all numpy scalar variables, and code like `isinstance(var,
np.floating)` should be replaced by `isinstance(var.dtype.type,
np.floating)` or similar. That is, the numeric dtype of the scalar is no
longer encoded in `type(var)` but in `var.dtype`: The fact that the
variable is a numpy scalar is decoupled from its numeric dtype.

This is useful because there are many "associated" properties of scalars
in common with arrays which have nothing to do with the dtype, which
ducktype implementors want to touch. I imagine this will come up a lot:
In that repo I also have an "ArrayCollection" ducktype which required a
"CollectionScalar" scalar, and similarly I imagine people implementing
units want the units attached to the scalar, independently of the dtype.

Cheers,
Allan




___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Moving forward with value based casting

2019-06-06 Thread Allan Haldane

On 6/6/19 12:46 PM, Sebastian Berg wrote:
> On Thu, 2019-06-06 at 11:57 -0400, Allan Haldane wrote:
>> I think dtype-based casting makes a lot of sense, the problem is
>> backward compatibility.
>>
>> Numpy casting is weird in a number of ways: The array + array casting
>> is
>> unexpected to many users (eg, uint64 + int64 -> float64), and the
>> casting of array + scalar is different from that, and value based.
>> Personally I wouldn't want to try change it unless we make a
>> backward-incompatible release (numpy 2.0), based on my experience
>> trying
>> to change much more minor things. We already put "casting" on the
>> list
>> of desired backward-incompatible changes on the list here:
>> https://github.com/numpy/numpy/wiki/Backwards-incompatible-ideas-for-a-major-release
>>
>> Relatedly, I've previously dreamed about a different "C-style" way
>> casting might behave:
>> https://gist.github.com/ahaldane/0f5ade49730e1a5d16ff6df4303f2e76
>>
>> The proposal there is that array + array casting, array + scalar, and
>> array + python casting would all work in the same dtype-based way,
>> which
>> mimics the familiar "C" casting rules.
> 
> If I read it right, you do propose that array + python would cast in a
> "minimal type" way for python.

I'm a little unclear what you mean by "minimal type" way. By "minimal
type", I thought you and others are talking about the rule numpy
currently uses that "the output dtype is the minimal dtype capable of
representing the value of both input dtypes", right? But in that gist I
am instead proposing that output-dtype is determined by C-like rules.

For array+py_scalar I was less certain what to do than for array+array
and array+npy_scalar. But I proposed the three "ranks" of 1. bool, 2.
int, and 3. float/complex. My rule for array+py_scalar is that if the
python scalar's rank is less than the numpy operand dtype's rank, use
the numpy dtype. If the python-scalar's rank is greater, use the
"default" types of bool_, int64, float64 respectively. Eg:

np.bool_(1) + 1-> int64   (default int wins)
np.int8(1) + 1 -> int8(numpy wins)
np.uint8(1) + (-1) -> uint8   (numpy wins)
np.int64(1) + 1-> int64   (numpy wins)
np.int64(1) + 1.0  -> float64 (default float wins)
np.float32(1.0) + 1.0  -> float32 (numpy wins)

Note it does not depend on the numerical value of the scalar, only its type.

> In your write up, you describe that if you mix array + scalar, the
> scalar uses a minimal dtype compared to the array's dtype. 

Sorry if I'm nitpicking/misunderstanding, but in my rules np.uint64(1) +
1 -> uint64 but in numpy's "minimal dtype" rules it is  -> float64. So I
don't think I am using the minimal rule.

> What we
> instead have is that in principle you could have loops such as:
> 
> "ifi->f"
> "idi->d"
> 
> and I think we should chose the first for a scalar, because it "fits"
> into f just fine. (if the input is) `ufunc(int_arr, 12., int_arr)`.

I feel I'm not understanding you, but the casting rules in my gist
follow those two rules if i, f are the numpy types int32 and float32.

If instead you mean (np.int64, py_float, np.int64) my rules would cast
to float64, since py_float has the highest rank and so is converted to
the default numpy-type for that rank, float64.

I would also add that unlike current numpy, my C-casting rules are
associative (if all operands are numpy types, see note below), so it
does not matter in which order you promote the types: (if)i  and i(fi)
give the same result. In current numpy this is not always the case:

p = np.promote_types
p(p('u2',   'i1'), 'f4')# ->  f8
p(  'u2', p('i1',  'f4'))   # ->  f4

(However, my casting rules are not associative if you include python
scalars.. eg  np.float32(1) + 1.0 + np.int64(1) . Maybe I should try to
fix that...)

Best,
Allan

> I do not mind keeping the "simple" two (or even more) operand "lets
> assume we have uniform types" logic around. For those it is easy to
> find a "minimum type" even before actual loop lookup.
> For the above example it would work in any case well, but it would get
> complicating, if for example the last integer is an unsigned integer,
> that happens to be small enough to fit also into an integer.
> 
> That might give some wiggle room, possibly also to attach warnings to
> it, or at least make things easier. But I would also like to figure out
> as well if we shouldn't try to move in any case. Sure, attach a major
> version to it, but hopefully not a "big step type".
> 
> One thing that I had not thought about is

Re: [Numpy-discussion] Moving forward with value based casting

2019-06-06 Thread Allan Haldane



I think dtype-based casting makes a lot of sense, the problem is
backward compatibility.

Numpy casting is weird in a number of ways: The array + array casting is
unexpected to many users (eg, uint64 + int64 -> float64), and the
casting of array + scalar is different from that, and value based.
Personally I wouldn't want to try change it unless we make a
backward-incompatible release (numpy 2.0), based on my experience trying
to change much more minor things. We already put "casting" on the list
of desired backward-incompatible changes on the list here:
https://github.com/numpy/numpy/wiki/Backwards-incompatible-ideas-for-a-major-release

Relatedly, I've previously dreamed about a different "C-style" way
casting might behave:
https://gist.github.com/ahaldane/0f5ade49730e1a5d16ff6df4303f2e76

The proposal there is that array + array casting, array + scalar, and
array + python casting would all work in the same dtype-based way, which
mimics the familiar "C" casting rules.

See also:
https://github.com/numpy/numpy/issues/12525

Allan


On 6/5/19 4:41 PM, Sebastian Berg wrote:
> Hi all,
> 
> TL;DR:
> 
> Value based promotion seems complex both for users and ufunc-
> dispatching/promotion logic. Is there any way we can move forward here,
> and if we do, could we just risk some possible (maybe not-existing)
> corner cases to break early to get on the way?
> 
> ---
> 
> Currently when you write code such as:
> 
> arr = np.array([1, 43, 23], dtype=np.uint16)
> res = arr + 1
> 
> Numpy uses fairly sophisticated logic to decide that `1` can be
> represented as a uint16, and thus for all unary functions (and most
> others as well), the output will have a `res.dtype` of uint16.
> 
> Similar logic also exists for floating point types, where a lower
> precision floating point can be used:
> 
> arr = np.array([1, 43, 23], dtype=np.float32)
> (arr + np.float64(2.)).dtype  # will be float32
> 
> Currently, this value based logic is enforced by checking whether the
> cast is possible: "4" can be cast to int8, uint8. So the first call
> above will at some point check if "uint16 + uint16 -> uint16" is a
> valid operation, find that it is, and thus stop searching. (There is
> the additional logic, that when both/all operands are scalars, it is
> not applied).
> 
> Note that while it is defined in terms of casting "1" to uint8 safely
> being possible even though 1 may be typed as int64. This logic thus
> affects all promotion rules as well (i.e. what should the output dtype
> be).
> 
> 
> There 2 main discussion points/issues about it:
> 
> 1. Should value based casting/promotion logic exist at all?
> 
> Arguably an `np.int32(3)` has type information attached to it, so why
> should we ignore it. It can also be tricky for users, because a small
> change in values can change the result data type.
> Because 0-D arrays and scalars are too close inside numpy (you will
> often not know which one you get). There is not much option but to
> handle them identically. However, it seems pretty odd that:
>  * `np.array(3, dtype=np.int32)` + np.arange(10, dtype=int8)
>  * `np.array([3], dtype=np.int32)` + np.arange(10, dtype=int8)
> 
> give a different result.
> 
> This is a bit different for python scalars, which do not have a type
> attached already.
> 
> 
> 2. Promotion and type resolution in Ufuncs:
> 
> What is currently bothering me is that the decision what the output
> dtypes should be currently depends on the values in complicated ways.
> It would be nice if we can decide which type signature to use without
> actually looking at values (or at least only very early on).
> 
> One reason here is caching and simplicity. I would like to be able to
> cache which loop should be used for what input. Having value based
> casting in there bloats up the problem.
> Of course it currently works OK, but especially when user dtypes come
> into play, caching would seem like a nice optimization option.
> 
> Because `uint8(127)` can also be a `int8`, but uint8(128) it is not as
> simple as finding the "minimal" dtype once and working with that." 
> Of course Eric and I discussed this a bit before, and you could create
> an internal "uint7" dtype which has the only purpose of flagging that a
> cast to int8 is safe.
> 
> I suppose it is possible I am barking up the wrong tree here, and this
> caching/predictability is not vital (or can be solved with such an
> internal dtype easily, although I am not sure it seems elegant).
> 
> 
> Possible options to move forward
> 
> 
> I have to still see a bit how trick things are. But there are a few
> possible options. I would like to move the scalar logic to the
> beginning of ufunc calls:
>   * The uint7 idea would be one solution
>   * Simply implement something that works for numpy and all except
> strange external ufuncs (I can only think of numba as a plausible
> candidate for creating such).
> 
> My current plan is to see where the second thing leaves me.
>

Re: [Numpy-discussion] Behaviour of copy for structured dtypes with gaps

2019-04-12 Thread Allan Haldane

I would be much more in favor of `copy` eliminating padding in the
dtype, if dtypes with different paddings were considered equivalent.
But they are not.

Numpy has always treated dtypes with different padding bytes as
not-equal, and prints them very differently:

>>> a = np.array([1], dtype={'names': ['f'],
...  'formats': ['i4'],
...  'offsets': [0]})
>>> b = np.array([1], dtype={'names': ['f'],
...  'formats': ['i4'],
...  'offsets': [4]})
>>> a.dtype == b.dtype
False
>>> a.dtype
dtype([('f', '>> b.dtype
dtype({'names':['f'], 'formats':['>> np.fromfile('myfile', a.dtype, count=10)

and then it matters very greatly to them whether the dtype has padding
or not.

Best,
Allan


PS. It is unfinished, but I would like to advertise an 'ArrayCollection'
ndarray ducktype I have worked a bit on. This ducktype behaves very much
like structured arrays for indexing and assignment, but avoids all these
padding issues and in other ways is more suitable for "pandas-like"
usage than structured arrays. See the "ArrayCollection" and
"MaskedArrayCollection" classes at
https://github.com/ahaldane/ndarray_ducktypes
See the tests and doc folders for some brief example usage.



On 4/11/19 10:07 PM, Nathaniel Smith wrote:
> My concern would be that to implement (2), I think .copy() has to
> either special-case certain dtypes, or else we have to add some kind
> of "simplify for copy" operation to the dtype protocol. These both add
> architectural complexity, so maybe it's better to avoid it unless we
> have a compelling reason?
> 
> On Thu, Apr 11, 2019 at 6:51 AM Marten van Kerkwijk
>  wrote:
>>
>> Hi All,
>>
>> An issue [1] about the copying of arrays with structured dtype raised a 
>> question about what the expected behaviour is: does copy always preserve the 
>> dtype as is, or should it remove padding?
>>
>> Specifically, consider an array with a structure with many fields, say 'a' 
>> to 'z'. Since numpy 1.16, if one does a[['a', 'z']]`, a view will be 
>> returned. In this case, its dtype will include a large offset. Now, if we 
>> copy this view, should the result have exactly the same dtype, including the 
>> large offset (i.e., the copy takes as much memory as the original full 
>> array), or should the padding be removed? From the discussion so far, it 
>> seems the logic has boiled down to a choice between:
>>
>> (1) Copy is a contract that the dtype will not vary (e.g., we also do not 
>> change endianness);
>>
>> (2) Copy is a contract that any access to the data in the array will return 
>> exactly the same result, without wasting memory and possibly optimized for 
>> access with different strides. E.g., `array[::10].copy() also compacts the 
>> result.
>>
>> An argument in favour of (2) is that, before numpy 1.16, `a[['a', 
>> 'z']].copy()` did return an array without padding. Of course, this relied on 
>> `a[['a', 'z']]` already returning a copy without padding, but still this is 
>> a regression.
>>
>> More generally, there should at least be a clear way to get the compact 
>> copy. Also, it would make sense for things like `np.save` to remove any 
>> padding (it currently does not).
>>
>> What do people think? All the best,
>>
>> Marten
>>
>> [1] https://github.com/numpy/numpy/issues/13299
>> ___
>> NumPy-Discussion mailing list
>> NumPy-Discussion@python.org
>> https://mail.python.org/mailman/listinfo/numpy-discussion
> 
> 
> 

___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] proposal of new keywords for np.nan_to_num

2019-04-08 Thread Allan Haldane

Since there seem to be no objections, I think we're going ahead with
this enhancement for np.nan_to_num.

Cheers,
Allan

On 4/4/19 1:11 PM, kikocorreoso wrote:
> Hi all,
> 
> I propose to add some keywords to nan_to_num function. The addition do 
> not modify the actual behavior. Information related with this addition 
> can be found in these links:
> https://github.com/numpy/numpy/pull/13219
> https://github.com/numpy/numpy/pull/9355
> 
> The basic idea is to allow the user to use their own defined values when 
> replacing nan, positive infinity and/or negative infinity. The proposed 
> names for the keywords are 'nan', posinf', and 'neginf' respectively. So 
> the usage would be something like this:
> 
 a = np.array((np.nan, 2, 3, np.inf, 4, 5, -np.inf))
 np.nan_to_num(a, nan=-999)
> array([-9.9900e+002,  2.e+000,  3.e+000,
>      1.79769313e+308,  4.e+000,  5.e+000,
>     -1.79769313e+308])
 np.nan_to_num(a, posinf=np.nan, neginf=np.nan)
> array([ 0.,  2.,  3., nan,  4.,  5., nan])
> 
> Please, could you comment if it would be useful the addition?, if the PR 
> needs any change?...
> 
> Thanks to Eric, Joseph, Allan and Matti for their comments and revisions 
> on GH.
> 
> Kind regards.
> 
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
> 

___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] the making of the NumPy roadmap

2018-12-07 Thread Allan Haldane

Thanks a lot for this summary and review, so us devs who weren't there 
are aware of off-list discussions. Thanks all for the brainstorming work 
too!


Cheers,
Allan


On 12/7/18 12:23 AM, Ralf Gommers wrote:

Hi all,

A little while ago I wrote a blog post about the history of us 
constructing the roadmap we currently have: 
https://rgommers.github.io/2018/10/the-making-of-the-numpy-roadmap/. 
There was a request to post it to this list, which seems appropriate. So 
below it is in full.


Cheers,
Ralf


NumPy now has a [roadmap](https://www.numpy.org/neps/index.html#roadmap) 
- long overdue and a major step forward for the project.  We're not done 
yet (see [my previous 
post](https://rgommers.github.io/2018/10/2018-numfocus-summit---a-summary/) 
on this topic) and updating the technical roadmap with new ideas and 
priorities should happen regularly.  Despite everything having been done 
in the open, via minutes of in-person meetings and shared roadmap drafts 
on the numpy-discussion mailing list, it turns out that it's not 
completely clear to the community and even some maintainers how we got 
to this point.  So now seems like a good time for a summary.


## Act 1: NumPy sprint at BIDS

During 24-25 May 2018 a number of people - Stéfan van der Walt, Charles 
Harris, Stephan Hoyer, Matti Picus, Jarrod Millman, Nathaniel J. Smith, 
and Matthew Rocklin - came together at 
[BIDS](https://bids.berkeley.edu/) in Berkeley for a NumPy sprint.  They 
had a productive brainstorm, which resulted in a rough draft of a 
roadmap posted to the mailing list.  The mailing list thread, [A roadmap 
for NumPy - longer term 
planning](https://mail.python.org/pipermail/numpy-discussion/2018-June/thread.html#78103), 
produced more suggestions and after incorporating those a [first pull 
request](https://github.com/numpy/numpy/pull/11446) was made.


## Act 2: BoF and sprint at SciPy'18

In order to collect input from as many people in the community as 
possible, a BoF ([birds of a 
feather](https://en.wikipedia.org/wiki/Birds_of_a_feather_(computing) 
session) was held on 12 July 2018 during the SciPy 2018 conference in 
Austin.  Stephan Hoyer presented an overview of recent NEPs, and Matti 
Picus presented the first roadmap draft and led a discussion with (I 
estimate) 30-40 people in which a number of new ideas were added.  Other 
maintainers present included Charles Harris, Tyler Reddy, Ralf Gommers 
and Stéfan van der Walt; many other participants were developers of 
libraries that depend on NumPy.


At this point we had a fairly complete list of roadmap items, however 
the document itself was still more a brainstorm dump than a roadmap 
document.  So during the NumPy sprint after the conference on 14 July 
2018 Ralf Gommers and Stephan Hoyer sat together to reshape and clean up 
these ideas.  That resulted in [roadmap draft 
v2](https://github.com/numpy/numpy/wiki/NumPy-roadmap-v2) - not yet 
fully polished, but getting closer to something ready for acceptance.


## Act 3: NumPy sprint at BIDS

In-person meetings seem to be critical to moving a complex and important 
document like a project roadmap forward.  So the next draft was produced 
during a second NumPy sprint at Berkeley during 23-27 July 2018.  
Present were Stéfan van der Walt, Matti Picus, Tyler Reddy and Ralf 
Gommers.  We polished the document, worked out the [Scope of 
NumPy](https://www.numpy.org/neps/scope.html) section in more detail, 
and integrated the roadmap in the html docs.  During the sprint a [pull 
request](https://github.com/numpy/numpy/pull/11611) was sent, and 
[posted](https://mail.python.org/pipermail/numpy-discussion/2018-July/thread.html#78458) 
again to the mailing list.  Everyone seemed happy, and after addressing 
the final few review comments the roadmap was merged on 2 August 2018.


*Final thought: NumPy development is accelerating at the moment, and 
in-person meetings and having the people and funding at BIDS to organize 
sprints has been a huge help. See https://github.com/BIDS-numpy/docs for 
an overview of what has been organized in the last year.*


___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion



___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Attribute hiding APIs for PyArrayObject

2018-10-31 Thread Allan Haldane

On 10/30/18 5:04 AM, Matti Picus wrote:
> TL;DR - should we revert the attribute-hiding constructs in
> ndarraytypes.h and unify PyArrayObject_fields with PyArrayObject?
> 
> 
> Background
> 
> 
> NumPy 1.8 deprecated direct access to PyArrayObject fields. It made
> PyArrayObject "opaque", and hid the fields behind a PyArrayObject_fields
> structure
> https://github.com/numpy/numpy/blob/v1.15.3/numpy/core/include/numpy/ndarraytypes.h#L659
> with a comment about moving this to a private header. In order to access
> the fields, users are supposed to use PyArray_FIELDNAME functions, like
> PyArray_DATA and PyArray_NDIM. It seems there were thoughts at the time
> that numpy might move away from a C-struct based
> 
> underlying data structure. Other changes were also made to enum names,
> but those are relatively painless to find-and-replace.
> 
> 
> NumPy has a mechanism to manage deprecating APIs, C users define
> NPY_NO_DEPRICATED_API to a desired level, say NPY_1_8_API_VERSION, and
> can then access the API "as if" they were using NumPy 1.8. Users who do
> not define NPY_NO_DEPRICATED_API get a warning when compiling, and
> default to the pre-1.8 API (aliasing of PyArrayObject to
> PyArrayObject_fields and direct access to the C struct fields). This is
> convenient for downstream users, both since the new API does not provide
> much added value, and it is much easier to write a->nd than
> PyArray_NDIM(a). For instance, pandas uses direct assignment to the data
> field for fast json parsing
> https://github.com/pandas-dev/pandas/blob/master/pandas/_libs/src/ujson/python/JSONtoObj.c#L203
> via chunks. Working around the new API in pandas would require more
> engineering. Also, for example, cython has a mechanism to transpile
> python code into C, mapping slow python attribute lookup to fast C
> struct field access
> https://cython.readthedocs.io/en/latest/src/userguide/extension_types.html#external-extension-types
> 
> 
> 
> In a parallel but not really related universe, cython recently upgraded
> the object mapping so that we can quiet the annoying "size changed"
> runtime warning https://github.com/numpy/numpy/issues/11788 without
> requiring warning filters, but that requires updating the numpy.pxd file
> provided with cython, and it was proposed that NumPy actually vendor its
> own file rather than depending on the cython one
> (https://github.com/numpy/numpy/issues/11803).
> 
> 
> The problem
> 
> 
> We have now made further changes to our API. In NumPy 1.14 we changed
> UPDATEIFCOPY to WRITEBACKIFCOPY, and in 1.16 we would like to deprecate
> PyArray_SetNumericOps and PyArray_GetNumericOps. The strange warning
> when NPY_NO_DEPRICATED_API is annoying. The new API cannot be supported
> by cython without some deep surgery
> (https://github.com/cython/cython/pull/2640). When I tried dogfooding an
> updated numpy.pxd for the only cython code in NumPy, mtrand.pxy, I came
> across some of these issues (https://github.com/numpy/numpy/pull/12284).
> Forcing the new API will require downstream users to refactor code or
> re-engineer constructs, as in the pandas example above.

I haven't understood the cython issue, but just want to mention that for
optimization purposes it's nice to be able to modify the fields, like in
the pandas/json example above.

In particular, PyArray_ConcatenateArrays uses some tricks which
temporarily clobber the data pointer and shape of an array to
concatenate arrays efficiently. It seems fairly safe to me. These tricks
would be nice to re-use in a C port of the new block code we merged
recently.

Those optimizations aren't possible if only using PyArray_Object.

Cheers,
Allan



> The question
> 
> 
> Is the attribute-hiding effort worth it? Should we give up, revert the
> PyArrayObject/PyArrayObject_fields division and allow direct access from
> C to the numpy internals? Is there another path forward that is less
> painful?
> 
> 
> Matti
> 
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Reminder: weekly status meeting 31.10 at 12:00 pacific time

2018-10-30 Thread Allan Haldane



I'll try to make it, but can't guarantee.



The last time there was discussion of the structured-array PRs which are
currently held up, and I sort of promised to have a writeup of the issues.

I put up a draft of that here:

https://gist.github.com/ahaldane/6cd44886efb449f9c8d5ea012747323b

Allan


On 10/30/18 3:15 PM, Matti Picus wrote:
> The draft agenda is at https://hackmd.io/D3I3CdO2T9ipZ2g5uAChcA?both.
> 
> Everyone is invited to join.
> 
> 
> Matti, Tyler and Stefan
> 
> 
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
> 

___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] BIDS/NumPy dev meetings, Wednesdays 12pm Pacific

2018-10-16 Thread Allan Haldane

I'll try to make it, especially as it looks like you want to discuss two
of my PRs! :)

I have a different meeting a bit before then which might run over
though, so sorry ahead of time if I'm not there.

Cheers,
Allan


On 10/16/18 5:26 PM, Stefan van der Walt wrote:
> Hi everyone,
> 
> This is a friendly reminder of the BIDS/NumPy dev meetings, kicking off
> tomorrow at 12pm Pacific time.
> 
> Please add any topics you wish to discuss to the agenda linked below.
> 
> Best regards,
> Stéfan
> 
> 
> On Thu, 11 Oct 2018 22:43:58 -0700, Stefan van der Walt wrote:
>> The team at BIDS meets once a week to discuss progress, priorities, and
>> roadblocks.  While our priorities are broadly determined by the project
>> roadmap [0], we would like to provide an opportunity for the community
>> to give more regular and detailed feedback on our work.
>>
>> We therefore invite you to join us for our weekly calls,
>> each **Wednesday from 12:00 to 13:00 Pacific Time**.
>>
>> Detail of the next meeting (2018-10-17) is given in the agenda [1],
>> which is a growing document—feel free to add topics you wish to discuss.
>>
>> We hope to see you there!  I will send another reminder next week.
>>
>>
>> [0] https://www.numpy.org/neps/index.html
>> [1] https://hackmd.io/YZfpGn5BSu6acAFLBaRjtw#
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
> 

___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] ndrange, like range but multidimensiontal

2018-10-10 Thread Allan Haldane

On 10/10/18 12:34 AM, Eric Wieser wrote:
> One thing that worries me here - in python, |range(...)| in essence
> generates a lazy |list| - so I’d expect |ndrange| to generate a lazy
> |ndarray|. In practice, that means it would be a duck-type defining an
> |__array__| method to evaluate it, and only implement methods already
> present in numpy.

Isn't that what arange is for?

It seems like there are two uses of python3's range: 1. creating a 1d
iterable of indices for use in for-loops, and 2. with list(range) can be
used to create a sequence of integers.

Numpy can extend this in two directions:
 * ndrange returns an iterable of nd indices (for for-loops).
 * arange returns an 1d ndarray of integers instead of a list

The application of for-loops, which is more niche, doesn't need
ndarray's vectorized properties, so I'm not convinced it should return
an ndarray. It certainly seems simpler not to return an ndarray, due to
the dtype question.

arange on its own seems to cover the need for a vectorized version of range.

Allan
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] ndrange, like range but multidimensiontal

2018-10-07 Thread Allan Haldane


On 10/07/2018 10:32 AM, Mark Harfouche wrote:

With `np.ndrange` it can look something like this:

```
c = np.empty((4, 4), dtype=object)
# ... compute on c
for i in np.ndrange(c.shape)[:, 1:-1]:
     c[i] # = some operation on c[i] that depends on the index i
```

very pythonic, very familiar to numpy users


So if I understand, this does the same as `np.ndindex` but allows 
numpy-like slicing of the returned generator object, as requested in #6393.


I don't like the duplication in functionality between ndindex and 
ndrange here. Better rather to add the slicing functionality to ndindex, 
 than create a whole new nearly-identical function. np.ndindex is 
already a somewhat obscure and discouraged method since it is usually 
better to find a vectorized numpy operation instead of a for loop, and I 
don't like adding more obscure functions.


But as an improvement to np.ndindex, I think adding this functionality 
seems good if it can be nicely implemented. Maybe there is a way to use 
the same optimization tricks as in the current implementation of ndindex 
but allow different stop/step? A simple wrapper of ndindex?


Cheers,
Allan
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] PR Cleanup

2018-09-25 Thread Allan Haldane

On 09/25/2018 11:51 AM, Charles R Harris wrote:
> Hi All,
> 
> As usual, the top of the PR stack is getting all the attention. As a
> start on cleaning up the old PRs, I'd like to suggest that all the
> maintainers look at their (long) pending PRs and decide which they want
> to keep, close those they don't want to pursue, and rebase the others.
> Might also help if they would post here the PRs that they think we
> should finish up.

PR 6377 [0] fixes up alignment bugs which cause complex64/128 to be
incorrectly aligned, which in turn causes further bugs. Eg,
"np.dtype('c8').alignment" is 8 but should be 4 on most systems.

The PR is in a finished state, modulo reviews. If anyone has access to
sparc/aarch systems it would be great to test it, though.

Cheers,
Allan

[0] https://github.com/numpy/numpy/pull/6377
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] CI Testing

2018-09-12 Thread Allan Haldane


On 09/12/2018 08:14 PM, Tyler Reddy wrote:
Would also be cool to have native ppc64 arch in CI, but I checked 
briefly with IBM and no such integration exists with github as far as 
the contact knew.


Hi Tyler,

ppc64 CI is available through OSUOSL. Earlier this year I requested a 
ppc64be test server through them, for numpy use, see this email and the 
links therein:


https://mail.python.org/pipermail/numpy-discussion/2018-January/077570.html

At the time I didn't sign up for CI because my focus was the test 
server, but I think it is still very much an option to ask for CI if we 
want. I don't yet have the time to do it myself, but I think anyone 
interested could do it pretty easily.


Incidentally, the test server is still available to any numpy devs who 
need it, just ping me for access.


Cheers,
Allan
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Adoption of a Code of Conduct

2018-08-02 Thread Allan Haldane

On 08/02/2018 09:25 AM, Ralf Gommers wrote:
> 
> 
> On Thu, Aug 2, 2018 at 3:35 AM, Marten van Kerkwijk
> mailto:m.h.vankerkw...@gmail.com>> wrote:
> 
> I think a legalistic focus on the letter rather than the spirit of
> the code of conduct is not that helpful (and probably what makes if
> feel US centric - funny how court systems end up shaping countries).
> 
> My preference would be to keep exactly the scipy version, so that at
> least for these two highly related project there is just one code of
> conduct. But for what it is worth, it does seem to me political
> views belongs on the list - certainly as likely to give problems
> historically as for any of the others on the list.
> 
> 
> +1 to all of that. If as a result of this discussion we'd want to amend
> the CoC, let's do that to the SciPy version first (or as well) to keep
> things in sync.

+1, I've just read through all this discussion, and I favor this
approach too.

I overall agree with Scipy CoC and would be fine accepting it as is.

I might prefer it to be shorter and less legalistic as well. One of the
goals of the CoC is to attract new and diverse contributors, but we
don't want to push anyone away with a scary legal document either. But
CoCs are a fairly new and somewhat untested phenomenon in open source,
so given that the scipy CoC seems like a good and reasonable effort.

By the way, I thought these articles on CoCs were interesting [1][2],
including the interviews with open-source CoC creators on their
experience in [1] on pros and cons of "rule-based" CoCs.

Allan

[1] Tourani, Adams and Serebrenik, "Code of conduct in open source
projects," 2017 IEEE 24th International Conference on SANER, Klagenfurt,
2017, pp. 24-33.
https://www.win.tue.nl/~aserebre/SANER2017.pdf

[2]
https://modelviewculture.com/pieces/the-new-normal-codes-of-conduct-in-2015-and-beyond

> Cheers,
> Ralf
> 
> 
> If we do end up with a different version, I'd prefer a really short
> one, like just stating the golden rule (treat others as one would
> like to be treated oneself).
> 
> -- Marten> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@python.org 
> https://mail.python.org/mailman/listinfo/numpy-discussion
> 
> 
> 
> 
> 
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
> 

___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Allowing broadcasting of code dimensions in generalized ufuncs

2018-05-31 Thread Allan Haldane

On 05/31/2018 04:14 PM, Marten van Kerkwijk wrote:
> I am -1 on multiple signatures. We may revisit this in time, but for
> now I find the minimal intrusiveness of the current changes
> appealing, especially as it requires few to no changes whatsoever to
> the inner loop function. Multiple dispatch could easily break that
> model by allowing very different signatures to be aggregated into a
> single ufunc, leading to unhandled edge cases and strange segfaults.
> It also seems to me that looping over all signatures might slow down
> ufunc calling, leading to endless variations of strategies of
> optimizing signature ordering.
> 
> 
> I had actually started trying Allan's suggestion [1], and at least
> parsing is not difficult. But I will stop now, as I think your point
> about the inner loop really needing a fixed set of sizes and strides is
> deadly for the whole idea. (Just goes to show I should think before
> writing code!)
> 
> As is, all the proposed changes do is fiddle with size 1 axes (well, and
> defining a fixed size rather than letting the operand do it), which of
> course doesn't matter for the inner loop.

Yes, after seeing how complicated some of the signatures become, I think
I'm more convinced the custom syntax is better.

Allan
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Allowing broadcasting of code dimensions in generalized ufuncs

2018-05-31 Thread Allan Haldane


On 05/31/2018 12:10 AM, Nathaniel Smith wrote:

On Wed, May 30, 2018 at 11:14 AM, Marten van Kerkwijk
- When it comes to the core ufunc machinery, we have a limited
complexity budget. I'm nervous that if we add too many bells and
whistles, we'll end up writing ourselves into a corner where we have
trouble maintaining it, where it becomes difficult to predict how
different features interact, it becomes increasingly difficult for
third-parties to handle all the different features in their
__array_ufunc__ methods...


Re: implenetation complexity, I just want to bring up multiple-dispatch
signatures again, where the new signature syntax would just be to join
some signatures together with "|", and try them in order until one works.

I'm not convinced it's better myself, I just wanted to make sure we are 
aware of it. The translation from the current proposed syntax would be:


  Current SyntaxMultiple-dispatch syntax

(n|1),(n|1)->()  <===>   (n),(n)->() | (n),()->() | (),(n)->()


(m?,n),(n,p?)->(m?,p?)  <===> (m,n),(n,p)->(m,p) |
  (n),(n,p)->(p) |
  (m,n),(n)->(m) |
  (n),(n)->()

Conceivably, multiple-dispatch could reduce code complexity because we
don't need all the special flags like UFUNC_CORE_DIM_CAN_BROADCAST, and
instead of handling special syntax for ? and | and any future syntax
separately, we just need a split("|") and then loop with the old 
signature handling code.


On the other hand the m-d signatures are much less concise and the
intention is perhaps harder to read. Yet they more explicitly state
which combinations are allowed, while with '?' syntax you might have to
puzzle it out.

Cheers,
Allan
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Allowing broadcasting of code dimensions in generalized ufuncs

2018-05-31 Thread Allan Haldane


On 05/31/2018 09:53 AM, Sebastian Berg wrote:



>

Also, I do not imagine these as free-floating ufuncs, I think we can
arrange them in a logical way in a gufunc ecosystem. There would be
some
"core ufuncs", with "associated gufuncs" accessible as attributes.
For
instance, any_less_than will be accessible as less.any



So then, why is it a gufunc and not an attribute using a ufunc with
binary output? I have asked this before, and even got arguments as to
why it fits gufuncs better, but frankly I still do not really
understand.

If it is an associated gufunc, why gufunc at all? We need any() and
all() here, so that is not that many methods, right? And when it comes
to buffering you have much more flexibility.

Say I have the operation:

(float_arr > int_arr).all(axis=(1, 2))

With int_arr being shaped (2, 1000, 1000) (i.e. large along the
interesting axes). A normal gufunc IIRC will get the whole inner
dimension as a float buffer. In other words, you gain practically
nothing, because the whole int_arr will be cast to float anyway.

If, however, you actually implement np.greater_than.all(float_arr,
int_arr, axis=(1, 2)) as a separate ufunc method, you would have the
freedom to work in the typical cache friendly buffersize chunk size for
each of the outer dimensions one at a time. A gufunc would require to
say: please do not buffer for me, or implement all possible type
combinations to do this.
(of course there are memory layout subtleties, since you would have to
optimize always for the "fast exit" case, potentially making the worst
case scenario much worse -- unless you do seriously fancy stuff
anyway).

A more general question is actually whether we should rather focus on
solving the same problem more generally.
For example if `numexpr` would implement all/any reductions, it may be
able to pretty simply get the identical tradeoffs with even more
flexibility! (I have to admit, it may get tricky with multiple
reduction dimensions, etc.)

- Sebastian


Hmm, I hadn't known/considered the limitations of gufunc buffer sizes. I 
was just thinking of them as a standardized interface which handles the 
where/out/broadcasting for you.


I'll have to read about it.

One thing I don't like about the ufunc-method strategy is that it esily 
pollutes all the ufuncs namespaces and their implementations, so many 
ufuncs have to account for a new "all" method even if innapropriate, for 
example.


Cheers,
Allan
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Allowing broadcasting of code dimensions in generalized ufuncs

2018-05-31 Thread Allan Haldane


On 05/31/2018 12:10 AM, Nathaniel Smith wrote:

On Wed, May 30, 2018 at 11:14 AM, Marten van Kerkwijk
 wrote:

Hi All,

Following on a PR combining the ability to provide fixed and flexible
dimensions [1] (useful for, e.g., 3-vector input with a signature like
`(3),(3)->(3)`, and for `matmul`, resp.; based on earlier PRs by Jaime
[2] and Matt (Picus) [3]), I've now made a PR with a further
enhancement, which allows one can indicate that a core dimension can
be broadcast [4].

A particular use case is `all_equal`, a new function suggested in a
stalled PR by Matt (Harrigan) [5], which compares two arrays
axis-by-axis, but short-circuits if a non-equality is found (unlike
what is the case if one does `(a==b).all(axis)`). One thing that would
be obviously useful for a routine like `all_equal` is to be able to
provide an array as one argument and a constant as another, i.e., if
the core dimensions can be broadcast if needed, just like they are in
`(a==b).all(axis)`. This is currently not possible: with its signature
of `(n),(n)->()`, the two arrays have to have the same trailing size.

My PR provides the ability to indicate in the signature that a core
dimension can be broadcast, by using a suffix of "|1". Thus, the
signature of `all_equal` would become:

```
(n|1),(n|1)->()
```

Comments most welcome (yes, even on the notation - though I think it
is fairly self-explanatory)!


I'm currently -0.5 on both fixed dimensions and this broadcasting
dimension idea. My reasoning is:

- The use cases seem fairly esoteric. For fixed dimensions, I guess
the motivating example is cross-product (are there any others?). But
would it be so bad for a cross-product gufunc to raise an error if it
receives the wrong number of dimensions? For this broadcasting case...
well, obviously we've survived this long without all_equal :-). And
there's something funny about all_equal, since it's really smushing
together two conceptually separate gufuncs for efficiency. Should we
also have all_less_than, sum_square, ...? If this is a big problem,
then wouldn't it be better to solve it in a general way, like dask or
Numba or numexpr do? To be clear, I'm not saying these features are
necessarily *bad* ideas, in isolation -- just that the benefits aren't
very convincing, and there are trade-offs, like:


I have often wished numpy had these short-circuiting gufuncs, for a very 
long time. I specifically remember my fruitless searches for how to do 
it back to 2007.


While "on average" short-circuiting only gives a speedup of 2x, in many 
situations you can arrange your algorithm so short circuiting will 
happen early, eg usually in the first 10 elements of a 10^6 element 
array, giving enormous speedups.


Also, I do not imagine these as free-floating ufuncs, I think we can 
arrange them in a logical way in a gufunc ecosystem. There would be some 
"core ufuncs", with "associated gufuncs" accessible as attributes. For 
instance, any_less_than will be accessible as less.any


binary "comparison" ufuncs would have attributes

less.any
less.all
less.first  # returns first matching index
less.count  # counts matches without intermediate bool array

This adds on to the existing attributes, for instance
ufuncs already have:

add.reduce
add.accumulate
add.reduceat
add.outer
add.at

It is unfortunate that all ufuncs currently have these attributes even 
if they are unimplemented/inappropriate (eg, np.sin.reduce), I would 
like to  remove the inappropriate ones, so each core ufunc will only 
have the appropriate attribute "associated gufuncs".


Incidentally, once we make reduce/accumuate/... into "associated 
gufuncs", I propose completely removing the "method" argument of 
__array_ufunc__, since it is no longer needed and adds a lot

of complexity which implementors of an __array_ufunc__ are forced to
account for.

Cheers,
Allan








- When it comes to the core ufunc machinery, we have a limited
complexity budget. I'm nervous that if we add too many bells and
whistles, we'll end up writing ourselves into a corner where we have
trouble maintaining it, where it becomes difficult to predict how
different features interact, it becomes increasingly difficult for
third-parties to handle all the different features in their
__array_ufunc__ methods...

- And, we have a lot of other demands on the core ufunc machinery,
that might be better places to spend our limited complexity budget.
For example, can we come up with an extension to make np.sort a
gufunc? That seems like a much higher priority than figuring out how
to make all_equal a gufunc. What about refactoring the ufunc machinery
to support user-defined dtypes? That'll need some serious work, and
again, it's probably higher priority than supporting cross-product or
all_equal directly (or at least it seems that way to me).

Maybe there are more compelling use cases that I'm missing, but as it
is, I feel like trying to add too many features to the current ufunc
machinery is pretty risky for future

Re: [Numpy-discussion] Splitting MaskedArray into a separate package

2018-05-24 Thread Allan Haldane

On 05/24/2018 11:31 AM, Sebastian Berg wrote:

> I also somewhat like the idea of taking it out (once we have a first
> replacement) in the case that we have a plan to do a better/lower level
> replacement at a later point within numpy.
> Removal generally has its merits, but if a (mid term) replacement will
> come in any case, it would be nice to get those started first if
> possible.
> Otherwise downstream might end up having to fix up things twice.
> 
> - Sebastian

Yes, I think the way forward is to start working on a new masked array
while keeping the old one in place.

Once it has progressed a little and we can step back and look at it, we
can consider how to switch over. I imagine we would have both present in
numpy under different names for a while.

Also, I think it would be nice to work on it soon because it is a chance
for us to eat our own dogfood in the __array_ufunc__ interface, which is
not yet set in stone so we can fix any problems we discover with it.

Allan
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Splitting MaskedArray into a separate package

2018-05-23 Thread Allan Haldane

On 05/23/2018 04:02 PM, Eric Firing wrote:
> Bad or missing values (and situations where one wants to use a mask to
> operate on a subset of an array) are found in many domains of real life;
> do you really want python users in those domains to have to fall back on
> Matlab-style reliance on nans and/or manual mask manipulations, as the
> new maskedarray package is sidelined?

I also think that missing value support is important to include inside
numpy, just as it is included in other numerical packages like R and Julia.

The time is ripe to write a new and better MaskedArray, because
__array_ufunc__ exists now. With some other numpy devs a few months ago
we also played with rewriting MA using __array_ufunc__ and fixing up all
the bugs and inconsistencies we have discovered over time (eg, getting
rid of the Masked constant). Both Eric and I started working on some
code changes, but never submitted PRs. See a little bit of discussion
here (there was some more elsewhere I can't find now):

https://github.com/numpy/numpy/pull/9792#issuecomment-46420

As I say there, numpy's current MA support is pretty poor compared to R
- Wes McKinney partly justified his desire to move pandas away from
numpy because of it. We have a lot to gain by implementing it nicely.

We already have an NEP discussing possible ways forward:
https://docs.scipy.org/doc/numpy-1.14.0/neps/missing-data.html

I was pretty excited by discussion above, and still am. I want to get
back to it after I finish more immediate priorities - finishing
printing/loading/saving fixes and structured array fixes.

But Masked-Array-2 is on my list of desired long-term enhancements for
numpy.

Allan

___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Extending ufunc signature syntax for matmul, frozen dimensions

2018-04-30 Thread Allan Haldane

On 04/29/2018 05:46 AM, Matti Picus wrote:
> In looking to solve issue #9028 "no way to override matmul/@ if
> __array_ufunc__ is set", it seems there is consensus around the idea of
> making matmul a true gufunc, but matmul can behave differently for
> different combinations of array and vector:
> 
> (n,k),(k,m)->(n,m)
> (n,k),(k) -> (n)
> (k),(k,m)->(m)
> 
> Currently there is no way to express that in the ufunc signature. The
> proposed solution to issue #9029 is to extend the meaning of a signature
> so "syntax like (n?,k),(k,m?)->(n?,m?) could mean that n and m are
> optional dimensions; if missing in the input, they're treated as 1, and
> then dropped from the output" Additionally, there is an open pull
> request #5015 "Add frozen dimensions to gufunc signatures" to allow
> signatures like '(3),(3)->(3)'.

How much harder would it be to implement multiple-dispatch for gufunc
signatures, instead of modifying the signature to include `?` ?

There was some discussion of this last year:

http://numpy-discussion.10968.n7.nabble.com/Changes-to-generalized-ufunc-core-dimension-checking-tp42618p42638.html

That sounded like a clean solution to me, although I'm a bit ignorant of
the gufunc internals and the compatibility constraints.

I assume gufuncs already have code to match the signature to the array
dims, so it sounds fairly straightforward (I say without looking at any
code) to do this in a loop over alternate signatures until one works.

Allan
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion

[Numpy-discussion] NumPy 1.14.3 released

2018-04-28 Thread Allan Haldane


Hi All,

I am pleased to announce the release of NumPy 1.14.3. This is a bugfix
release for a few bugs reported following the 1.14.2 release:

 * np.lib.recfunctions.fromrecords accepts a list-of-lists, until 1.15
 * In python2, float types use the new print style when printing to a file
 * style arg in "legacy" print mode now works for 0d arrays

The Python versions supported in this release are 2.7 and 3.4 - 3.6. The
Python 3.6 wheels available from PIP are built with Python 3.6.2 and should
be compatible with all previous versions of Python 3.6. The source releases
were cythonized with Cython 0.28.2.

Contributors


A total of 6 people contributed to this release.  People with a "+" by their
names contributed a patch for the first time.

* Allan Haldane
* Charles Harris
* Jonathan March +
* Malcolm Smith +
* Matti Picus
* Pauli Virtanen

Pull requests merged


A total of 8 pull requests were merged for this release.

* `#10862 <https://github.com/numpy/numpy/pull/10862>`__: BUG: floating 
types should override tp_print (1.14 backport)
* `#10905 <https://github.com/numpy/numpy/pull/10905>`__: BUG: for 1.14 
back-compat, accept list-of-lists in fromrecords
* `#10947 <https://github.com/numpy/numpy/pull/10947>`__: BUG: 'style' 
arg to array2string broken in legacy mode (1.14...
* `#10959 <https://github.com/numpy/numpy/pull/10959>`__: BUG: test, fix 
for missing flags['WRITEBACKIFCOPY'] key
* `#10960 <https://github.com/numpy/numpy/pull/10960>`__: BUG: Add 
missing underscore to prototype in check_embedded_lapack
* `#10961 <https://github.com/numpy/numpy/pull/10961>`__: BUG: Fix 
encoding regression in ma/bench.py (Issue #10868)
* `#10962 <https://github.com/numpy/numpy/pull/10962>`__: BUG: core: fix 
NPY_TITLE_KEY macro on pypy
* `#10974 <https://github.com/numpy/numpy/pull/10974>`__: BUG: test, fix 
PyArray_DiscardWritebackIfCopy...


Cheers,

Allan Haldane
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Short-circuiting equivalent of np.any or np.all?

2018-04-26 Thread Allan Haldane

On 04/26/2018 12:45 PM, Nathan Goldbaum wrote:
> I'm curious if there's a way to get a fast early-terminating search in
> NumPy? Perhaps there's another package I can depend on that does this? I
> guess I could also write a bit of cython code that does this but so far
> this project is pure python and I don't want to deal with the packaging
> headache of getting wheels built and conda-forge packages set up on all
> platforms.
> 
> Thanks for your help!
> 
> -Nathan

A current PR that implements short-circuiting for "all"-like operations is:

https://github.com/numpy/numpy/pull/8528

Actually, I have a little dream that we will be able to implement this
kind of short-circuiting more generally in numpy soon, following the
idea in that PR of turning functions into gufuncs. We just need to add
some finishing touches on the gufunc implementation first.

We are almost there - the one important feature gufuncs are still
missing is support for "multiple axis" arguments. See
https://github.com/numpy/numpy/issues/8810.

Once that is done I also think there are some other new and useful
short-circuiting gufuncs we could add, like "count" and "first". See
some comments:

https://github.com/numpy/numpy/pull/8528#issuecomment-365358119

I am imagining we will end up with a "gufunc ecosystem", where there
are some core ufuncs like np.add, np.multiply, np.less_than,
and then a bunch of "associated" gufuncs for each of these, like reduce,
first, all, accessible as attributes of the core ufunc.

(It has long been vaguely planned to turn "reduce" into a gufunc too,
according to comments in the code. I'm excited for when that can happen!)

Allan
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] nditer as a context manager (reformatted?)

2018-03-26 Thread Allan Haldane

Given the lack of objections, we are probably going forward with this
change to nditer.

Anyone who uses nditers may have to update their code slightly if they
want to avoid deprecation warnings, but otherwise old nditer code should
work for a long time from now.

Allan

On 03/22/2018 01:43 PM, Matti Picus wrote:
> Hello all, PR #9998 (https://github.com/numpy/numpy/pull/9998/) proposes
> an update to the nditer API, both C and python. The issue
> (https://github.com/numpy/numpy/issues/9714) is that sometimes nditer
> uses temp arrays via the "writeback" mechanism, the data is copied back
> to the original arrays "when finished". However "when finished" was
> implemented using nditer deallocation.
> 
> This mechanism is implicit and unclear, and relies on refcount semantics
> which do not work on non-refcount python implementations like PyPY. It
> also leads to lines of code like "iter=None" to trigger the writeback
> resolution.
> 
> On the c-api level the agreed upon solution is to add a new
> `NpyIter_Close` function in C, this is to be called before
> `NpyIter_Dealloc`.
> 
> The reviewers and I would like to ask the wider NumPy community for
> opinions about the proposed python-level solution: turning the python
> nditer object into a context manager. This way "writeback" occurs at
> context manager exit via a call to `NpyIter_Close`, instead of like
> before when it occurred at `nditer` deallocation (which might not happen
> until much later in Pypy, and could be delayed by GC even in Cpython).
> 
> Another solution that was rejected
> (https://github.com/numpy/numpy/pull/10184) was to add an nditer.close()
> python-level function that would not require a context manager It was
> felt that this is more error-prone since it requires users to add the
> line for each iterator created.
> 
> The back-compat issues are that:
> 
> 1. We are adding a new function to the numpy API, `NpyIter_Close`
> (pretty harmless)
> 
> 2. We want people to update their C code using nditer, to call
> `NpyIter_Close` before  they call `NpyIter_Dealloc` and will start
> raising a deprecation warning if misuse is detected
> 
> 3. We want people to update their Python code to use the nditer object
> as a context manager, and will warn if they do not.
> 
> We tried to minimize back-compat issues, in the sense that old code
> (which didn't work in PyPy anyway) will still work, although it will now
> emit deprecation warnings. In the future we also plan to raise an error
> if an nditer is used in Python without a context manager (when it should
> have been). For C code, we plan to leave the deprecation warning in
> place probably forever, as we can only detect the deprecated behavior in
> the deallocator, where exceptions cannot be raised.
> 
> Anybody who uses nditers should take a look and please reply if it seems
> the change will be too painful.
> 
> For more details, please see the updated docs in that PR
> 
> Matti (and reviewers)
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Using np.frombuffer and cffi.buffer on array of C structs (problem with struct member padding)

2018-01-31 Thread Allan Haldane


On 01/31/2018 02:06 AM, Joe wrote:

Does someone know of a function or a convenient way to automatically
derive a dtype object from a C typedef struct string or a cffi.typeof()?


I remember when I wrote those docs over a year ago, I searched for an 
established way to do this but didn't find one. If you find/come up with 
one, let us know and I might add a pointer or description of it in the docs.


Searching just now, I came across this recent gist:

https://gist.github.com/inactivist/4ef7058c2132fa16759d

So it looks like we can do something like:

# typemap from C type to numpy type may need some work...
typemap = {'char': 'byte', 'short': 'short', 'int': 'intc',
   'long long': 'longlong', 'size_t': 'intp',
   'float': 'float', 'double': 'double'}

def extract_field_info(tp):
fields = []
for field, fieldtype in tp.fields:
if fieldtype.type.kind == 'primitive':
format = typemap[fieldtype.type.cname]
else:
format = extract_field_info(fieldtype.type)
fields.append((field, format, fieldtype.offset))
return fields

 from cffi import FFI
 ffi = FFI()
 ffi.cdef("struct foo { int a; char b; double d;};")
 tp = ffi.typeof('struct foo')

 names, formats, offsets = zip(*extract_field_info(tp))
 np.dtype({'names': names, 'formats': formats, 'offsets': offsets})
 # compare to: np.dtype('i4,i1,f8', align=True)

I just made that up now and it may be buggy and have missing corner 
cases. But if you agree it works, maybe we can mention it in the 
structured array docs instead of the vague sentence there now, that 
"some work may be needed to obtain correspondence with the C struct".


Allan


Am 27.01.2018 10:30 schrieb Joe:

Thanks for your help on this! This solved my issue.


Am 25.01.2018 um 19:01 schrieb Allan Haldane:

There is a new section discussing alignment in the numpy 1.14 structured
array docs, which has some hints about interfacing with C structs.

These new 1.14 docs are not online yet on scipy.org, but in the meantime
  you can view them here:
https://ahaldane.github.io/user/basics.rec.html#automatic-byte-offsets-and-alignment 



(That links specifically to the discussion of alignments and padding).

Allan

On 01/25/2018 11:33 AM, Chris Barker - NOAA Federal wrote:




The numpy dtype constructor takes an “align” keyword that will pad it
for you.


https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.dtype.html 



-CHB



___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion



___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Setting custom dtypes and 1.14

2018-01-30 Thread Allan Haldane

On 01/30/2018 04:54 PM, josef.p...@gmail.com wrote:
> 
> 
> On Tue, Jan 30, 2018 at 3:21 PM, Allan Haldane <allanhald...@gmail.com
> <mailto:allanhald...@gmail.com>> wrote:
> 
> On 01/30/2018 01:33 PM, josef.p...@gmail.com
> <mailto:josef.p...@gmail.com> wrote:
> > AFAICS, one problem is that the padded view didn't come with the
> > matching down stream usage support, the pack function as mentioned, an
> > alternative way to convert to a standard ndarray, copy doesn't get rid
> > of the padding and so on.
> >
> > eg. another mailing list thread I just found with the same problem
> > 
> http://numpy-discussion.10968.n7.nabble.com/view-of-recarray-issue-td32001.html
> 
> <http://numpy-discussion.10968.n7.nabble.com/view-of-recarray-issue-td32001.html>
> >
> > quoting Ralf:
> > Question: is that really the recommended way to get an (N, 2) size float
> > array from two columns of a larger record array? If so, why isn't there
> > a better way? If you'd want to write to that (N, 2) array you have to
> > append a copy, making it even uglier. Also, then there really should be
> > tests for views in test_records.py.
> >
> >
> > This "better way" never showed up, AFAIK. And it looks like we came back
> > to this problem every few years.
> >
> > Josef
> 
> Since we are at least pushing off this change to a later release
> (1.15?), we have some time to prepare/catch up.
> 
> What can we add to numpy.lib.recfunctions to make the multi-field
> copy->view change smoother? We have discussed at least two functions:
> 
>  * repack_fields - rearrange the memory layout of a structured array to
> add/remove padding between fields
> 
>  * structured_to_unstructured - turns a n-D structured array into an
> (n+1)-D unstructured ndarray, whose dtype is the highest common type of
> all the fields. May want the inverse function too.
> 
> 
> The only sticky point with statsmodels is to have an equivalent of
> a[['b', 'c']].view(('f8', 2)).
> 
> Highest common dtype might be object, the main usecase for this is to
> select some elements of a specific dtype and then use them as
> standard,homogeneous ndarray. In our case and other cases that I have
> seen it is mainly to select a subset of the floating point numbers.
> Another case of this might be to combine two strings into one  a[['b',
> 'c']].view(('S8'))    if b is s5 and c is S3, but I don't think I used
> this in serious code.

I implemented and put up a draft of these functions in
https://github.com/numpy/numpy/pull/10411

I think they satisfy all your cases: code like

>>> a = np.ones(3, dtype=[('a', 'f8'), ('b', 'f8'), ('c', 'f8')])
>>> a[['b', 'c']].view(('f8', 2))`

becomes:

>>> import numpy.lib.recfunctions as rf
>>> rf.structured_to_unstructured(a[['b', 'c']])
array([[1., 1.],
   [1., 1.],
   [1., 1.]])

The highest common dtype is usually not "Object", since I use
`np.result_type` to determine the output type. So two fields of 'S5' and
'S3' result in an 'S5' array.


> 
> for inverse function: I guess it is still possible to view any standard
> homogenous ndarray with a structured dtype as long as the itemsize matches.

The inverse is implemented too. And it even supports varied field
dtypes, nested fields, and subarrays, as you can see in the docstring
examples.


> Browsing through old mailing list threads, I saw that adding multiple
> fields or concatenating two arrays with structured dtypes into an array
> with a single combined dtype was missing and I guess still is. (IIRC
> this is the usecase where we go now the pandas detour in statsmodels.)
> 
> We might also consider
> 
>  * apply_along_fields(arr, method) - applies the method along the
> "field" axis, equivalent to something like
> method(struct_to_unstructured(arr), axis=-1)
> 
> 
> If this works on a padded view of an existing array, then this would be
> an improvement over the current version of having to extract and copy
> the relevant fields of an existing structured dtype or loop over
> different numeric dtypes, ints, floats.
> 
> In general there will need to be a way to apply `method` only to
> selected columns, or columns of a matching dtype. (e.g. We don't want
> the sum or mean of a string.)
> (e.g. we use ptp() on numeric fields to check if there is already a
> constant column in the array or dataframe)

Means over selected columns are accounted for using multi-field
indexing. For example:

>>> b = np.array([

Re: [Numpy-discussion] Setting custom dtypes and 1.14

2018-01-30 Thread Allan Haldane

On 01/29/2018 11:50 PM, josef.p...@gmail.com wrote:

On Mon, Jan 29, 2018 at 10:44 PM, Allan Haldane <allanhald...@gmail.com 
<mailto:allanhald...@gmail.com>> wrote:

On 01/29/2018 05:59 PM, josef.p...@gmail.com
<mailto:josef.p...@gmail.com> wrote:

On Mon, Jan 29, 2018 at 5:50 PM, <josef.p...@gmail.com
<mailto:josef.p...@gmail.com> <mailto:josef.p...@gmail.com
<mailto:josef.p...@gmail.com>>> wrote:

     On Mon, Jan 29, 2018 at 4:11 PM, Allan Haldane
     <allanhald...@gmail.com <mailto:allanhald...@gmail.com>
<mailto:allanhald...@gmail.com <mailto:allanhald...@gmail.com>>>
wrote:

         On 01/29/2018 04:02 PM, josef.p...@gmail.com
<mailto:josef.p...@gmail.com>
         <mailto:josef.p...@gmail.com
<mailto:josef.p...@gmail.com>> wrote:
         >
         >
         > On Mon, Jan 29, 2018 at 3:44 PM, Benjamin Root
<ben.v.r...@gmail.com <mailto:ben.v.r...@gmail.com>
<mailto:ben.v.r...@gmail.com <mailto:ben.v.r...@gmail.com>>
         > <mailto:ben.v.r...@gmail.com
<mailto:ben.v.r...@gmail.com> <mailto:ben.v.r...@gmail.com
<mailto:ben.v.r...@gmail.com>>>> wrote:
         >
         >     I <3 structured arrays. I love the fact that I
can access data by
         >     row and then by fieldname, or vice versa. There
are times when I
         >     need to pass just a column into a function, and
there are times when
         >     I need to process things row by row. Yes, pandas
is nice if you want
         >     the specialized indexing features, but it becomes
a bear to deal
         >     with if all you want is normal indexing, or even
the ability to
         >     easily loop over the dataset.
         >
         >
         > I don't think there is a doubt that structured
arrays, arrays with
         > structured dtypes, are a useful container. The
question is whether they
         > should be more or the foundation for more.
         >
         > For example, computing a mean, or reduce operation,
over numeric element
         > ("columns"). Before padded views it was possible to
index by selecting
         > the relevant "columns" and view them as standard
array. With padded
         > views that breaks and AFAICS, there is no way in
numpy 1.14.0 to compute
         > a mean of some "columns". (I don't have numpy 1.14 to
try or find a
         > workaround, like maybe looping over all relevant
columns.)
         >
         > Josef

         Just to clarify, structured types have always had
padding bytes,
         that
         isn't new.

         What *is* new (which we are pushing to 1.15, I think)
is that it
         may be
         somewhat more common to end up with padding than
before, and
         only if you
         are specifically using multi-field indexing, which is a
fairly
         specialized case.

         I think recfunctions already account properly for
padding bytes.
         Except
         for the bug in #8100, which we will fix, padding-bytes in
         recarrays are
         more or less invisible to a non-expert who only cares about
         dataframe-like behavior.

         In other words, padding is no obstacle at all to
computing a
         mean over a
         column, and single-field indexes in 1.15 behave
identically as
         before.
         The only thing that will change in 1.15 is multi-field
indexing,
         and it
         has never been possible to compute a mean (or any binary
         operation) on
         multiple fields.

     from the example in the other thread
     a[['b', 'c']].view(('f8', 2)).mean(0)

     (from the statsmodels usecase:
     read csv with genfromtext to get recarray or structured array
     select/index the numeric columns
     view them as standard array
     do whatever we can do with standard numpy  arrays
     )

Oh ok, I misunderstood. I see your point: a mean over fields is more
difficult than before.

Or, to phrase it as a question:

How do we get a standard array with homogeneous dtype from the

Re: [Numpy-discussion] Setting custom dtypes and 1.14

2018-01-29 Thread Allan Haldane

On 01/29/2018 05:59 PM, josef.p...@gmail.com wrote:

On Mon, Jan 29, 2018 at 5:50 PM, <josef.p...@gmail.com 
<mailto:josef.p...@gmail.com>> wrote:

On Mon, Jan 29, 2018 at 4:11 PM, Allan Haldane
<allanhald...@gmail.com <mailto:allanhald...@gmail.com>> wrote:

On 01/29/2018 04:02 PM, josef.p...@gmail.com
<mailto:josef.p...@gmail.com> wrote:
>
>
> On Mon, Jan 29, 2018 at 3:44 PM, Benjamin Root <ben.v.r...@gmail.com 
<mailto:ben.v.r...@gmail.com>
> <mailto:ben.v.r...@gmail.com <mailto:ben.v.r...@gmail.com>>> wrote:
>
>     I <3 structured arrays. I love the fact that I can access data by
>     row and then by fieldname, or vice versa. There are times when I
>     need to pass just a column into a function, and there are times 
when
>     I need to process things row by row. Yes, pandas is nice if you 
want
>     the specialized indexing features, but it becomes a bear to deal
>     with if all you want is normal indexing, or even the ability to
>     easily loop over the dataset.
>
>
> I don't think there is a doubt that structured arrays, arrays with
> structured dtypes, are a useful container. The question is whether 
they
> should be more or the foundation for more.
>
> For example, computing a mean, or reduce operation, over numeric 
element
> ("columns"). Before padded views it was possible to index by selecting
> the relevant "columns" and view them as standard array. With padded
> views that breaks and AFAICS, there is no way in numpy 1.14.0 to 
compute
> a mean of some "columns". (I don't have numpy 1.14 to try or find a
> workaround, like maybe looping over all relevant columns.)
>
> Josef

Just to clarify, structured types have always had padding bytes,
that
isn't new.

What *is* new (which we are pushing to 1.15, I think) is that it
may be
somewhat more common to end up with padding than before, and
only if you
are specifically using multi-field indexing, which is a fairly
specialized case.

I think recfunctions already account properly for padding bytes.
Except
for the bug in #8100, which we will fix, padding-bytes in
recarrays are
more or less invisible to a non-expert who only cares about
dataframe-like behavior.

In other words, padding is no obstacle at all to computing a
mean over a
column, and single-field indexes in 1.15 behave identically as
before.
The only thing that will change in 1.15 is multi-field indexing,
and it
has never been possible to compute a mean (or any binary
operation) on
multiple fields.

from the example in the other thread
a[['b', 'c']].view(('f8', 2)).mean(0)

(from the statsmodels usecase:
read csv with genfromtext to get recarray or structured array
select/index the numeric columns
view them as standard array
do whatever we can do with standard numpy  arrays
)

Oh ok, I misunderstood. I see your point: a mean over fields is more 
difficult than before.

Or, to phrase it as a question:

How do we get a standard array with homogeneous dtype from the 
corresponding elements of a structured dtype in numpy 1.14.0?

Josef

The answer may be that "numpy has never had a way to that",
even if in a few special cases you might hack a workaround using views.

That's what your example seems like to me. It uses an explicit view, 
which is an "expert" feature since views depend on the exact memory 
layout and binary representation of the array. Your example only works 
if the two fields have exactly the same dtype as each other and as the 
final dtype, and evidently breaks if there is byte padding for any reason.

Pandas can do row means without these problems:

>>> pd.DataFrame(np.ones(10, dtype='i8,f8')).mean(axis=0)

Numpy is missing this functionality, so you or whoever wrote that 
example figured out a fragile workaround using views.

I suggest that if we want to allow either means over fields, or 
conversion of a n-D structured array to an n+1-D regular ndarray, we 
should add a dedicated function to do so in numpy.lib.recfunctions

which does not depend on the binary representation of the array.

Allan

Josef

Allan

>
>     Cheers!
>     Ben Root
>
>     On Mon, Jan 29, 2018 at 3:24 PM, <josef.p...@gmail.com 
<mailto:josef.p...@gmail.com>
>     <mailto:josef.p...@gmail.com <mailto:josef.p...@gmail.com>>> 
wrote:

Re: [Numpy-discussion] Setting custom dtypes and 1.14

2018-01-29 Thread Allan Haldane

On 01/29/2018 04:02 PM, josef.p...@gmail.com wrote:
> 
> 
> On Mon, Jan 29, 2018 at 3:44 PM, Benjamin Root  > wrote:
> 
> I <3 structured arrays. I love the fact that I can access data by
> row and then by fieldname, or vice versa. There are times when I
> need to pass just a column into a function, and there are times when
> I need to process things row by row. Yes, pandas is nice if you want
> the specialized indexing features, but it becomes a bear to deal
> with if all you want is normal indexing, or even the ability to
> easily loop over the dataset.
> 
> 
> I don't think there is a doubt that structured arrays, arrays with
> structured dtypes, are a useful container. The question is whether they
> should be more or the foundation for more.
> 
> For example, computing a mean, or reduce operation, over numeric element
> ("columns"). Before padded views it was possible to index by selecting
> the relevant "columns" and view them as standard array. With padded
> views that breaks and AFAICS, there is no way in numpy 1.14.0 to compute
> a mean of some "columns". (I don't have numpy 1.14 to try or find a
> workaround, like maybe looping over all relevant columns.)
> 
> Josef

Just to clarify, structured types have always had padding bytes, that
isn't new.

What *is* new (which we are pushing to 1.15, I think) is that it may be
somewhat more common to end up with padding than before, and only if you
are specifically using multi-field indexing, which is a fairly
specialized case.

I think recfunctions already account properly for padding bytes. Except
for the bug in #8100, which we will fix, padding-bytes in recarrays are
more or less invisible to a non-expert who only cares about
dataframe-like behavior.

In other words, padding is no obstacle at all to computing a mean over a
column, and single-field indexes in 1.15 behave identically as before.
The only thing that will change in 1.15 is multi-field indexing, and it
has never been possible to compute a mean (or any binary operation) on
multiple fields.

Allan

> 
> Cheers!
> Ben Root
> 
> On Mon, Jan 29, 2018 at 3:24 PM,  > wrote:
> 
> 
> 
> On Mon, Jan 29, 2018 at 2:55 PM, Stefan van der Walt
> > wrote:
> 
> On Mon, 29 Jan 2018 14:10:56 -0500, josef.p...@gmail.com
>  wrote:
> 
> Given that there is pandas, xarray, dask and more, numpy
> could as well drop
> any pretense of supporting dataframe_likes. Or, adjust
> the recfunctions so
> we can still work dataframe_like with structured
> dtypes/recarrays/recfunctions.
> 
> 
> I haven't been following the duckarray discussion carefully,
> but could
> this be an opportunity for a dataframe protocol, so that we
> can have
> libraries ingest structured arrays, record arrays, pandas
> dataframes,
> etc. without too much specialized code?
> 
> 
> AFAIU while not being in the data handling area, pandas defines
> the interface and other libraries provide pandas compatible
> interfaces or implementations.
> 
> statsmodels currently still has recarray support and usage. In
> some interfaces we support pandas, recarrays and plain arrays,
> or anything where asarray works correctly.
> 
> But recarrays became messy to support, one rewrite of some
> functions last year converts recarrays to pandas, does the
> manipulation and then converts back to recarrays.
> Also we need to adjust our recarray usage with new numpy
> versions. But there is no real benefit because I doubt that
> statsmodels still has any recarray/structured dtype users. So,
> we only have to remove our own uses in the datasets and unit tests.
> 
> Josef
> 
>  
> 
> 
> Stéfan
> 
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@python.org 
> https://mail.python.org/mailman/listinfo/numpy-discussion
> 
> 
> 
> 
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@python.org 
> https://mail.python.org/mailman/listinfo/numpy-discussion
> 
> 
> 
> 
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@python.org 
>

Re: [Numpy-discussion] Multiple-field indexing: view vs copy in 1.14+

2018-01-27 Thread Allan Haldane


On 01/25/2018 03:56 PM, josef.p...@gmail.com wrote:



On Thu, Jan 25, 2018 at 1:49 PM, Marten van Kerkwijk 
> wrote:


On Thu, Jan 25, 2018 at 1:16 PM, Stefan van der Walt
> wrote:
> On Mon, 22 Jan 2018 10:11:08 -0500, Marten van Kerkwijk wrote:
>>
>> I think on the consistency argument is perhaps the most important:
>> views are very powerful and in many ways one *counts* on them
>> happening, especially in working with large arrays.
>
>
> I had the same gut feeling, but the fancy indexing example made me
> pause:
>
> In [9]: x = np.arange(12, dtype=float).reshape((3, 4))
>
> In [10]: p = x[[0, 1]]  # copy of data
>
> Then:
>
> In [11]: x = np.array([(0, 1), (2, 3)], dtype=[('a', int), ('b', int)])
>
> In [12]: p = x[['a', 'b']]  # copy of data, but proposal will change that


What does this do?
p = x[['a', 'b']].copy()


In 1.14.0 this creates an exact copy of what was returned by
`x[['a', 'b']]`, including any padding bytes.

My impression is that the problems with the view are because the padded 
view doesn't behave like a "standard" dtype or array, i.e. the follow-up 
behavior is the problematic part.


I think the padded view is a "standard array" in the sense that you can 
easily create structured arrays with padding bytes, for example by using 
the `align=True` options.


>>> np.zeros(3, dtype=np.dtype('u1,f4', align=True))
array([(0, 0.), (0, 0.), (0, 0.)],
  dtype={'names':['f0','f1'], 'formats':['u1','>> np.zeros(3, dtype='u1,u1,u1,u1,f4')[['f0', 'f4']]
array([(0, 0.), (0, 0.), (0, 0.)],
  dtype={'names':['f0','f4'], 'formats':['u1','
> We're not doing the same kind of indexing here exactly (in one case we
> grab elements, in the other parts of elements), but the view behavior
> may still break the "mental expectation".

A bit off-topic, but maybe this is another argument to just allow
`x['a', 'b']` -- I never understood why a tuple was not the
appropriate iterable for getting multiple items from a record.

-- Marten
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org 
https://mail.python.org/mailman/listinfo/numpy-discussion





___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion



___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Setting custom dtypes and 1.14

2018-01-26 Thread Allan Haldane

On 01/25/2018 08:53 PM, Chris Barker - NOAA Federal wrote:
>> On Jan 25, 2018, at 4:06 PM, Allan Haldane <allanhald...@gmail.com> wrote:
> 
>>> 1) This is a known change with good reason?
> 
>> . The
>> change occurred because the old assignment behavior was dangerous, and
>> was not doing what you thought.
> 
> OK, that’s a good reason!
> 
>>> A) improve the error message.
>>
>> Good idea. I'll see if we can do it for 1.14.1.
> 
> What do folks think about a totuple() method — even before this I’ve
> wanted that. But in this case, it seems particularly useful.
> 
> -CHB

Two thoughts:

1. `totuple` makes most sense for 2d arrays. But what should it do for
1d or 3+d arrays? I suppose it could make the last dimension a tuple, so
1d arrays would give a list of tuples of size 1.

2. structured array's .tolist() already returns a list of tuples. If we
have a 2d structured array, would it add one more layer of tuples? That
would raise an exception if read back in by `np.array` with the same dtype.

These points make me think that instead of a `.totuple` method, this
might be more suitable as a new function in np.lib.recfunctions. If the
goal is to help manipulate structured arrays, that submodule is
appropriate since it already has other functions do manipulate fields in
similar ways. What about calling it `pack_last_axis`?

def pack_last_axis(arr, names=None):
if arr.names:
return arr
names = names or ['f{}'.format(i) for i in range(arr.shape[-1])]
return arr.view([(n, arr.dtype) for n in names]).squeeze(-1)

Then you could do:

>>> pack_last_axis(uv).tolist()

to get a list of tuples.

Allan
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Setting custom dtypes and 1.14

2018-01-25 Thread Allan Haldane

On 01/25/2018 06:06 PM, Chris Barker wrote:
> Hi all,
> 
> I'm pretty sure this is the same thing as recently discussed on this
> list about 1.14, but to confirm:
> 
> I had failures in my code with an upgrade for 1.14 -- turns out it was a
> single line in a single test fixture, so no big deal, but a regression
> just the same, with no deprecation warning.
> 
> I was essentially doing this:
> 
> In [*48*]: dt
> 
> Out[*48*]: dtype([('time', ' ' 
> 
> In [*49*]: uv
> 
> Out[*49*]: 
> 
> array([[1., 1.],
> 
>        [1., 1.],
> 
>        [1., 1.],
> 
>        [1., 1.]])
> 
> 
> In [*50*]: time
> 
> Out[*50*]: array([1, 1, 1, 1])
> 
> 
> In [*51*]: full = np.array(zip(time, uv), dtype=dt)
> 
> ---
> 
> ValueError                                Traceback (most recent call last)
> 
> in ()
> 
> > 1full =np.array(zip(time,uv),dtype=dt)
> 
> 
> ValueError: setting an array element with a sequence.
> 
> 
> 
> It took some poking, but the solution was to do:
> 
> full = np.array(zip(time, (tuple(w) *for*w *in*uv)), dtype=dt)
> 
> 
> That is, convert the values to nested tuples, rather than an array in a
> tuple, or a list in a tuple.
> 
> As I said, my problem is solved, but to confirm:
> 
> 1) This is a known change with good reason?


This change is a little different from what we discussed before. The
change occurred because the old assignment behavior was dangerous, and
was not doing what you thought. If you modify your dtype above changing
both 'f8' fields to 'f4', you will see you get very strange results:
Your array gets filled in with the values
(1, ( 0.,  1.875)).

Here's what happened: Previously, numpy was *not* iterating your data as
a sequence. Instead, if numpy did not find a tuple it would interpret
the data a a raw buffer and copy the value byte-by-byte, ignoring
endianness, casting, stride, etc. You can get even weirder results if
you do `uv = uv.astype('i4')`, for example.

It happened to work for you because ndarrays expose a buffer interface,
and you were assigning using exactly the same type and endianness.

In 1.14 the fix was to disallow this 'buffer' assignment for structured
arrays, it was causing quite confusing bugs. Unstructured "void" arrays
still do this though.

> 2) My solution was the best (only) one -- the only way to set a nested
> dtype like that is with tuples?

Right, our solution was to only allow assignment from tuples.

We might be able to relax that for structured scalars, but for arrays I
remember one consideration was to avoid confusion with array
broadcasting: If you do

>>> x = np.zeros(2, dtype='i4,i4')
>>> x[:] = np.array([3, 4])
>>> x
array([(3, 3), (4, 4)], dtype=[('f0', '>> x[:] = (3, 4)
>>> x
array([(3, 4), (3, 4)], dtype=[('f0', ' If so, then I think we should:
> 
> A) improve the error message.
> 
> "ValueError: setting an array element with a sequence."
> 
> Is not really clear -- I spent a while trying to figure out how I could
> set a nested dtype like that without a sequence? and I was actually
> using a ndarray, so it wasn't even a generic sequence. And a tuple is a
> sequence, too...
> 
> I had a vague recollection that in some circumstances, numpy treats
> tuples and lists (and arrays) differently (fancy indexing??), so I tried
> the tuple thing and that worked. But I've been around numpy a long time
> -- that could have been very very confusing to many people.
> 
> So could the message be changed to something like:
> 
> "ValueError: setting an array element with a generic sequence. Only the
> tuple type can be used in this context."
> 
> or something like that -- I'm not sure where else this same error
> message might pop up, so that could be totally inappropriate.

Good idea. I'll see if we can do it for 1.14.1.


> 2) maybe add a .totuple()method to ndarray, much like the .tolist()
> method? that would have been handy here.
>> -Chris
> 
> 
> -- 
> 
> Christopher Barker, Ph.D.
> Oceanographer
> 
> Emergency Response Division
> NOAA/NOS/OR            (206) 526-6959    voice
> 7600 Sand Point Way NE   (206) 526-6329    fax
> Seattle, WA  98115       (206) 526-6317   
> main reception
> 
> chris.bar...@noaa.gov 
> 
> 
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
> 

___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Using np.frombuffer and cffi.buffer on array of C structs (problem with struct member padding)

2018-01-25 Thread Allan Haldane

There is a new section discussing alignment in the numpy 1.14 structured
array docs, which has some hints about interfacing with C structs.

These new 1.14 docs are not online yet on scipy.org, but in the meantime
 you can view them here:
https://ahaldane.github.io/user/basics.rec.html#automatic-byte-offsets-and-alignment

(That links specifically to the discussion of alignments and padding).

Allan

On 01/25/2018 11:33 AM, Chris Barker - NOAA Federal wrote:
> 
>>
>> The numpy dtype constructor takes an “align” keyword that will pad it
>> for you.
> 
> https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.dtype.html
> 
> -CHB
> 
> 
> 
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
> 

___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Multiple-field indexing: view vs copy in 1.14+

2018-01-22 Thread Allan Haldane

On 01/22/2018 11:23 AM, Allan Haldane wrote:
> I just want to note that I
> thought of another way to "fix" this for 1.15 which does not involve
> "pack_fields", which is
> 
>     a[['b', 'c']].astype('f8,f8').view(('f8', 2))
> 
> Which is back-compatible will numpy back to 1.7, I think.

Apologies, this is not back-compatible, do not use it.

I forgot that past versions of numpy had a weird quirk that this will
replace all the structured data with 0s.

Allan

___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Multiple-field indexing: view vs copy in 1.14+

2018-01-22 Thread Allan Haldane


On 01/22/2018 10:53 AM, josef.p...@gmail.com wrote:
On Sun, Jan 21, 2018 at 9:48 PM, Allan Haldane <allanhald...@gmail.com 

In  many examples where I used structured dtypes a long time ago, 
switched between consistent views as either a standard array of subsets 
or as .structured dtypes.
For this usecase it wouldn't matter whether a[['a', 'c']] returns a view 
or copy, as long as we can get the second view that is consistent with 
the selected part of the memory. This would also be independent of 
whether numpy pads internally and adjusts the strides if possible or not.



np.__version__

'1.11.2'


a = np.ones(5, dtype=[('a', 'i8'), ('b', 'f8'), ('c', 'f8')])
a

array([(1, 1.0, 1.0), (1, 1.0, 1.0), (1, 1.0, 1.0), (1, 1.0, 1.0),
        (1, 1.0, 1.0)],
       dtype=[('a', '<i8'), ('b', '<f8'), ('c', '<f8')])

a[['b', 'c']].view(('f8', 2)).dtype

dtype('float64')


Thanks for a real example to think about. I just want to note that I 
thought of another way to "fix" this for 1.15 which does not involve 
"pack_fields", which is


a[['b', 'c']].astype('f8,f8').view(('f8', 2))

Which is back-compatible will numpy back to 1.7, I think.

So that's another option to ease the transition.

Allan
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Multiple-field indexing: view vs copy in 1.14+

2018-01-22 Thread Allan Haldane


On 01/22/2018 10:53 AM, josef.p...@gmail.com wrote:


This is similar to the above example
a[['a', 'c']].view('i8')
but it doesn't try to combine fields.

In  many examples where I used structured dtypes a long time ago, 
switched between consistent views as either a standard array of subsets 
or as .structured dtypes.
For this usecase it wouldn't matter whether a[['a', 'c']] returns a view 
or copy, as long as we can get the second view that is consistent with 
the selected part of the memory. This would also be independent of 
whether numpy pads internally and adjusts the strides if possible or not.



np.__version__

'1.11.2'


a = np.ones(5, dtype=[('a', 'i8'), ('b', 'f8'), ('c', 'f8')])
a

array([(1, 1.0, 1.0), (1, 1.0, 1.0), (1, 1.0, 1.0), (1, 1.0, 1.0),
        (1, 1.0, 1.0)],
       dtype=[('a', '

[Numpy-discussion] Multiple-field indexing: view vs copy in 1.14+

2018-01-21 Thread Allan Haldane


Hello all,

We are making a decision (again) about what to do about the
behavior of multiple-field indexing of structured arrays: Should
it return a view or a copy, and on what release schedule?

As a reminder, this refers to operations like (1.13 behavior):

>>> a = np.zeros(3, dtype=[('a', 'i4'), ('b', 'i4'), ('c', 'f4')])
>>> a[['a', 'c']]
array([(0, 0.), (0, 0.), (0, 0.)],
  dtype=[('a', '

[Numpy-discussion] PowerPC machine available from OSUOSL

2018-01-05 Thread Allan Haldane

Hi all,

For building and testing Numpy on the PowerPC arch, I've requested and
obtained a VM instance at OSUOSL, which provides free hosting for
open-source projects. This was suggested by @npanpaliya on github.

http://osuosl.org/services/powerdev/request_hosting/

I have an immediate use for it to fix some float128 ppc problems, but it
can be useful to other numpy devs too. If you are a numpy dev and want
access to a ppc system for testing, I can gladly create an account for
you. For now, just send me an email with your desired username and a
public ssh key. We can discuss permissions and management depending on
interest.

=== VM details ===

It is single node, Ubuntu 16.04.1, ppc64 POWER (Big Endian) arch,
"m1_small" flavor, with usage policy at
http://osuosl.org/services/hosting/policy/

I've installed the packages needed to build and test numpy.
I ran tests: Numpy 1.13.3 has 2 unimportant test failures, but as
expected 1.14 has ~20 more failures related to float128 reprs.

=== Other services to Consider ===

I did not request it but they provide a "Power CI" service which allows
automated testing of ppc arch on github. If we want this I can look into
it more. We might also consider asking for a node with ppc64le (Little
Endian) arch for testing, but maybe the big-endian one is enough.

Cheers,
Allan
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] PyArray_GETITEM and PyArray_SETITEM

2017-11-17 Thread Allan Haldane

On 11/13/2017 01:53 PM, Mmanu Chaturvedi wrote:
> Hello All,
> 
> I need to make use of the limited numpy API access Pybind11 gives, in order
> to add a feature to it.  It seems to give access to functions from
> numpy_api.py [1].  I need to use PyArray_GETITEM and PyArray_SETITEM in
> order to get and set array elements [2], these functions / macros  are not
> exposed via numpy_api.py, but are in `numpy/ndarraytypes.h`.
> 
> We were wondering why aren't PyArray_GETITEM and PyArray_SETITEM exposed
> like the rest of numpy API?  Is it possible to replicate the behavior using
> the members exposed in numpy_api.py ?  Any help would be appreciated.
> 
> Mmanu

It looks like that was the plan. There are comments there saying they
would become part of the API in "numpy 2.0" (which hasn't happened yet).

In the meantime, maybe you can use PySequence_SetItem? I expect that
there is only very minimal overhead in using that vs PyArray_SETITEM.

Allan
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] np.vstack vs. np.stack

2017-11-09 Thread Allan Haldane

On 11/09/2017 05:39 PM, Robert Kern wrote:
> On Thu, Nov 9, 2017 at 1:58 PM, Mark Bakker  wrote:
> 
>> Can anybody explain why vstack is going the way of the dodo?
>> Why are stack / concatenate better? What is 'bad' about vstack?
> 
> As far as I can tell, the discussion happened all on Github, not the
> mailing list. See here for references:
> 
> https://github.com/numpy/numpy/pull/7253
> 
> --
> Robert Kern

yes, and in particular this linked comment/PR:

https://github.com/numpy/numpy/pull/5605#issuecomment-85180204

Maybe we should reword the vstack docstring so that it doesn't imply
that vstack is going away. It should say something weaker
like "the functions np.stack, np.concatenate, and np.block are often
more general/useful/less confusing alternatives".. or better explain
what the problem is.

If we limit ourselves to 1d,2d and maybe 3d arrays the vstack behavior
doesn't seem all that confusing to me.

Allan
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] np.vstack vs. np.stack

2017-11-09 Thread Allan Haldane

On 11/09/2017 04:30 AM, Joe wrote:
> Hello,
> 
> I have a question and hope that you can help me.
> 
> The doc for vstack mentions that "this function continues to be
> supported for backward compatibility, but you should prefer
> np.concatenate or np.stack."
> 
> Using vstack was convenient because "the arrays must have the same shape
> along all but the first axis."
> 
> So it was possible to stack an array (3,) and (2, 3) to a (3, 3) array
> without using e.g. atleast_2d on the (3,) array.
> 
> Is there a possibility to mimic that behavior with np.concatenate or
> np.stack?
> 
> Joe

I might write this as either

np.concatenate([a[newaxis,:], b])

(which switches a newaxis for an atleast_2d, and is also more explicit
about where the axis is added), or, as

np.block([[a],[b]])

Both act like vstack.

Allan
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] deprecate updateifcopy in nditer operand, flags?

2017-11-08 Thread Allan Haldane

On 11/08/2017 03:12 PM, Nathaniel Smith wrote:
> - We could adjust the API so that there's some explicit operation to
> trigger the final writeback. At the Python level this would probably
> mean that we start supporting the use of nditer as a context manager,
> and eventually start raising an error if you're in one of the "unsafe"
> case and not using the context manager form. At the C level we
> probably need some explicit "I'm done with this iterator now" call.
>
> One question is which cases exactly should produce warnings/eventually
> errors. At the Python level, I guess the simplest rule would be that
> if you have any write/readwrite arrays in your iterator, then you have
> to use a 'with' block. At the C level, it's a little trickier, because
> it's hard to tell up-front whether someone has updated their code to
> call a final cleanup function, and it's hard to emit a warning/error
> on something that *doesn't* happen. (You could print a warning when
> the nditer object is GCed if the cleanup function wasn't called, but
> you can't raise an error there.) I guess the only reasonable option is
> to deprecate NPY_ITER_READWRITE and NP_ITER_WRITEONLY, and make people
> switch to passing new flags that have the same semantics but also
> promise that the user has updated their code to call the new cleanup
> function.
Seems reasonable.

When people use the Nditer C-api, they (almost?) always call
NpyIter_Dealloc when they're done. Maybe that's a place to put a warning
for C-api users. I think you can emit a warning there since that
function calls the GC, not the other way around.

It looks like you've already discussed the possibilities of putting
things in NpyIter_Dealloc though, and it could be tricky, but if we only
need a warning maybe there's a way.
https://github.com/numpy/numpy/pull/9269/files/6dc0c65e4b2ea67688d6b617da3a175cd603fc18#r127707149

Allan
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] np.array, copy=False and memmap

2017-08-10 Thread Allan Haldane

On 08/10/2017 02:24 PM, Sebastian Berg wrote:
> On Thu, 2017-08-10 at 12:27 -0400, Allan Haldane wrote:
>> On 08/07/2017 05:01 PM, Nisoli Isaia wrote:
>>> Dear all,
>>> I have a question about the behaviour of
>>>
>>> y = np.array(x, copy=False, dtype='float32')
>>>
>>> when x is a memmap. If we check the memmap attribute of mmap
>>>
>>> print "mmap attribute", y._mmap
>>>
>>> numpy tells us that y is not a memmap.
>>> But the following code snippet crashes the python interpreter
>>>
>>> # opens the memmap
>>> with open(filename,'r+b') as f:
>>>   mm = mmap.mmap(f.fileno(),0)
>>>   x = np.frombuffer(mm, dtype='float32')
>>>
>>> # builds an array from the memmap, with the option copy=False
>>> y = np.array(x, copy=False, dtype='float32')
>>> print "before", y
>>>
>>> # closes the file
>>> mm.close()
>>> print "after", y
>>>
>>> In my code I use memmaps to share read-only objects when doing
>>> parallel
>>> processing
>>> and the behaviour of np.array, even if not consistent, it's
>>> desirable.
>>> I share scipy sparse matrices over many processes and if np.array
>>> would
>>> make a copy
>>> when dealing with memmaps this would force me to rewrite part of
>>> the sparse
>>> matrices
>>> code.
>>> Would it be possible in the future releases of numpy to have
>>> np.array
>>> check,
>>> if copy is false, if y is a memmap and in that case return a full
>>> memmap
>>> object
>>> instead of slicing it?
>>
>> This does appear to be a bug in numpy or mmap.
>>
> 
> Frankly on first sight, I do not think it is a bug in either of them.
> Numpy uses view (memmap really is just a name for a memory map backed
> numpy array). The numpy array will hold a reference to the memory map
> object in its `.base` attribute (or the base of the base, etc.).
> 
> If you close a mmap object, and then keep using it, you can get
> segfaults of course, I am not sure what you can do about it. Maybe
> python can try to warn you when you exit the context/close a file
> pointer, but I suppose: Python does memory management for you, it makes
> doing IO management easy, but you need to manage the IO correctly. That
> this segfaults and not just errors may be annoying, but seems the
> nature of things on first sight.
> 
> - Sebastian

I admit I have not had time to investigate it thoroughly, but it appears
to me that the intended design of mmap was to make it impossible to
close a mmap if there were still pointers to it.

Consider the following behavior (python3):

>>> import mmap
>>> with open('test', 'r+b') as f:
>>> mm = mmap.mmap(f.fileno(),0)
>>> mv = memoryview(mm)
>>> mm.close()
BufferError: cannot close exported pointers exist

If memoryview behaves this way, why doesn't/can't ndarray? (Both use the
PEP3118 interface, as far as I understand).

You can see in the mmap code that it tries to carefully keep track of
any exported buffers, but numpy manages to bypass this:
https://github.com/python/cpython/blob/b879fe82e7e5c3f7673c9a7fa4aad42bd05445d8/Modules/mmapmodule.c#L727



Allan
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Scipy 2017 NumPy sprint

2017-07-02 Thread Allan Haldane

On 07/02/2017 10:03 AM, Charles R Harris wrote:

Updated list below.

On Sat, Jul 1, 2017 at 7:08 PM, Benjamin Root > wrote:

Just a heads-up. There is now a sphinx-gallery plugin. Matplotlib
and a few other projects have migrated their docs over to use it.

https://sphinx-gallery.readthedocs.io/en/latest/

Cheers!
Ben Root

On Sat, Jul 1, 2017 at 7:12 AM, Ralf Gommers > wrote:

On Fri, Jun 30, 2017 at 6:50 AM, Pauli Virtanen > wrote:

Charles R Harris kirjoitti 29.06.2017 klo 20:45:
> Here's a random idea: how about building a NumPy gallery?
> scikit-{image,learn} has it, and while those projects may 
have more
> visual datasets, I can imagine something along the lines of 
Nicolas
> Rougier's beautiful book:
>
> http://www.labri.fr/perso/nrougier/from-python-to-numpy/

> >
>
>
> So that would be added in the  numpy
> /numpy.org 
> > repo?

Or https://scipy-cookbook.readthedocs.io/
  ?
(maybe minus bitrot and images added :)
_

I'd like the numpy.org  one. numpy.org
 is now incredibly sparse and ugly, a gallery
would make it look a lot better.

Another idea, from the "deprecate np.matrix" discussion: add
numpy documentation describing the preferred way to handle
matrices, extolling the virtues of @, and move np.matrix
documentation to a deprecated section.

  Putting things together with a few new ideas,

 1. add gallery to numpy.org ,
 2. add extended documentation of '@' operator,
 3. make Numpy tests Pytest compatible,
 4. add matrix multiplication ufunc.

  Any more ideas?

The new doctest runner suggested in the printing thread? This is to 
ignore whitespace and precision in ndarray output.

I can see an argument for distributing it in numpy if it is designed to 
be specially aware of ndarrays or numpy scalars (eg to test equality 
between 'wants' and 'got')

Allan

Chuck

___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] proposed changes to array printing in 1.14

2017-06-30 Thread Allan Haldane

On 06/30/2017 09:17 AM, Gael Varoquaux wrote:
> Indeed, for scikit-learn, this would be a major problem.
> 
> Gaël

I just ran the scikit-learn tests.

With the new behavior (removed whitespace), I do get 70 total failures:

$ make test-doc
Ran 39 tests in 39.503s
FAILED (SKIP=3, failures=19)

$ make test
Ran 8122 tests in 387.650s
FAILED (SKIP=58, failures=51)

After setting `np.set_printoptions(pad_sign=True)` (see other email) I
get only 1 failure in total, which is due to the presence of a 0d array
in gaussian_process.rst.

So it looks like the pad_sign option as currently implemented is good
enough to avoid almost all doctest errors.

Allan



> On Fri, Jun 30, 2017 at 05:55:52PM +1000, Juan Nunez-Iglesias wrote:
>> To reiterate my point on a previous thread, I don't think this should happen
>> until NumPy 2.0. This *will* break a massive number of doctests, and what's
>> worse, it will do so in a way that makes it difficult to support doctesting 
>> for
>> both 1.13 and 1.14. I don't see a big enough benefit to these changes to
>> justify breaking everyone's tests before an API-breaking version bump.
> 
>> On 30 Jun 2017, 6:42 AM +1000, Marten van Kerkwijk 
>> <m.h.vankerkw...@gmail.com>,
>> wrote:
> 
>> To add to Allan's message: point (2), the printing of 0-d arrays, is
>> the one that is the most important in the sense that it rectifies a
>> really strange situation, where the printing cannot be logically
>> controlled by the same mechanism that controls >=1-d arrays (see PR).
> 
>> While point 3 can also be considered a bug fix, 1 & 4 are at some
>> level matters of taste; my own reason for supporting their
>> implementation now is that the 0-d arrays already forces me (or,
>> specifically, astropy) to rewrite quite a few doctests, and I'd rather
>> have everything in one go -- in this respect, it is a pity that this
>> is separate from the earlier change in printing for structured arrays
>> (which was also much for the better, but broke a lot of doctests).
> 
>> -- Marten
> 
> 
> 
>> On Thu, Jun 29, 2017 at 3:38 PM, Allan Haldane <allanhald...@gmail.com>
>> wrote:
> 
>> Hello all,
> 
>> There are various updates to array printing in preparation for numpy
>> 1.14. See https://github.com/numpy/numpy/pull/9139/
> 
>> Some are quite likely to break other projects' doc-tests which 
>> expect a
>> particular str or repr of arrays, so I'd like to warn the list in 
>> case
>> anyone has opinions.
> 
>> The current proposed changes, from most to least painful by my
>> reckoning, are:
> 
>> 1.
>> For float arrays, an extra space previously used for the sign 
>> position
>> will now be omitted in many cases. Eg, `repr(arange(4.))` will now
>> return 'array([0., 1., 2., 3.])' instead of 'array([ 0., 1., 2., 
>> 3.])'.
> 
>> 2.
>> The printing of 0d arrays is overhauled. This is a bit finicky to
>> describe, please see the release note in the PR. As an example of the
>> effect of this, the `repr(np.array(0.))` now prints as 'array(0.)`
>> instead of 'array(0.0)'. Also the repr of 0d datetime arrays is now
>> like
>> "array('2005-04-04', dtype='datetime64[D]')" instead of
>> "array(datetime.date(2005, 4, 4), dtype='datetime64[D]')".
> 
>> 3.
>> User-defined dtypes which did not properly implement their `repr` 
>> (and
>> `str`) should do so now. Otherwise it now falls back to
>> `object.__repr__`, which will return something ugly like
>> ``. (Previously you could depend on
>> only implementing the `item` method and the repr of that would be
>> printed. But no longer, because this risks infinite recursions.).
> 
>> 4.
>> Bool arrays of size 1 with a 'True' value will now omit a space, so
>> that
>> `repr(array([True]))` is now 'array([True])' instead of
>> 'array([ True])'.
> 
>> Allan
>> ___
>> NumPy-Discussion mailing list
>> NumPy-Discussion@python.org
>> https://mail.python.org/mailman/listinfo/numpy-discussion
> 
>> ___
>> NumPy-Discussion mailing list
>> NumPy-Discussion@python.org
>> https://mail.python.org/mailman/listinfo/numpy-discussion
> 
> 
>> ___
>> NumPy-Discussion mailing list
>> NumPy-Discussion@python.org
>> https://mail.python.org/mailman/listinfo/numpy-discussion
> 
> 

___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion

58 matches

Mail list logo