Re: [Numpy-discussion] Behavior of .reduceat()

2016-05-22 Thread Feng Yu
Hi Marten,

As a user of reduceat I seriously like your new proposal!

I notice that in your current proposal, each element in the 'at' list
shall be interpreted asif they are parameters to `slice`.

I wonder if it is meaningful to define reduceat on other `fancy` indexing types?

Cheers,

- Yu

On Sun, May 22, 2016 at 12:15 PM, Marten van Kerkwijk
 wrote:
> Hi Jaime,
>
> Very belated reply, but only with the semester over I seem to have regained
> some time to think.
>
> The behaviour of reduceat always has seemed a bit odd to me, logical for
> dividing up an array into irregular but contiguous pieces, but illogical for
> more random ones (where one effectively passes in pairs of points, only to
> remove the unwanted calculations after the fact by slicing with [::2];
> indeed, the very first example in the documentation does exactly this [1]).
> I'm not sure any of your proposals helps all that much for the latter case,
> while it risks breaking existing code in unexpected ways.
>
> For me, for irregular pieces, it would be much nicer to simply pass in pairs
> of points. I think this can be quite easily done in the current API, by
> expanding it to recognize multidimensional index arrays (with last dimension
> of 2; maybe 3 for step as well?). These numbers would just be the equivalent
> of start, end (and step?) of `slice`, so I think one can allow any integer
> with negative values having the usual meaning and clipping at 0 and length.
> So, specifically, the first example in the documentation would change from:
>
> np.add.reduceat(np.arange(8),[0,4, 1,5, 2,6, 3,7])[::2]
>
> to
>
> np.add.reduceat(np.arange(8),[(0, 4), (1, 5), (2, 6), (3,7)])
>
> (Or an equivalent ndarray. Note how horrid the example is: really, you'd
> want 4,8 as a pair too, but in the current API, you'd get that by adding a
> 4.)
>
> What do you think? Would this also be easy to implement?
>
> All the best,
>
> Marten
>
> [1]
> http://docs.scipy.org/doc/numpy/reference/generated/numpy.ufunc.reduceat.html
>
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> https://mail.scipy.org/mailman/listinfo/numpy-discussion
>
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Behavior of .reduceat()

2016-05-22 Thread Marten van Kerkwijk
Hi Jaime,

Very belated reply, but only with the semester over I seem to have regained
some time to think.

The behaviour of reduceat always has seemed a bit odd to me, logical for
dividing up an array into irregular but contiguous pieces, but illogical
for more random ones (where one effectively passes in pairs of points, only
to remove the unwanted calculations after the fact by slicing with [::2];
indeed, the very first example in the documentation does exactly this [1]).
I'm not sure any of your proposals helps all that much for the latter case,
while it risks breaking existing code in unexpected ways.

For me, for irregular pieces, it would be much nicer to simply pass in
pairs of points. I think this can be quite easily done in the current API,
by expanding it to recognize multidimensional index arrays (with last
dimension of 2; maybe 3 for step as well?). These numbers would just be the
equivalent of start, end (and step?) of `slice`, so I think one can allow
any integer with negative values having the usual meaning and clipping at 0
and length. So, specifically, the first example in the documentation would
change from:

np.add.reduceat(np.arange(8),[0,4, 1,5, 2,6, 3,7])[::2]

to

np.add.reduceat(np.arange(8),[(0, 4), (1, 5), (2, 6), (3,7)])

(Or an equivalent ndarray. Note how horrid the example is: really, you'd
want 4,8 as a pair too, but in the current API, you'd get that by adding a
4.)

What do you think? Would this also be easy to implement?

All the best,

Marten

​[1]
http://docs.scipy.org/doc/numpy/reference/generated/numpy.ufunc.reduceat.html
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] axis parameter for count_nonzero

2016-05-22 Thread G Young
After some discussion with *@rgommers*, I have simplified the code as
follows:

1) the path to the original count_nonzero in the C API is essentially
unchanged, save some small overhead with Python calling and the
if-statement to check the *axis* parameter

2) All of the complicated validation of the *axis* parameter and acrobatics
for getting the count is handled *only* after we cannot fast-track via a
numerical, boolean, or string *dtype*.

The question still remains whether or not leaving the *axis* parameter in
the Python API for now (given how complicated it is to add in the C API) is
acceptable.  I will say that in response to the concern of adding
parameters such as "out" and "keepdims" (should they be requested), we
could avail ourselves to functions like median

for
help as *@juliantaylor* pointed out.  The *scipy* library has dealt with
this problem as well in its *sparse* modules, so that is also a useful
resource.

On Sun, May 22, 2016 at 1:35 PM, G Young  wrote:

> 1) Correction: The PR was not written with small arrays in mind.  I ran
> some new timing tests, and it does perform worse on smaller arrays but
> appears to scale better than the current implementation.
>
> 2) Let me put it out there that I am not opposed to moving it to C, but
> right now, there seems to be a large technical brick wall up against such
> an implementation.  So suggestions about how to move the code into C would
> be welcome too!
>
> On Sun, May 22, 2016 at 10:32 AM, Ralf Gommers 
> wrote:
>
>>
>>
>> On Sun, May 22, 2016 at 3:05 AM, G Young  wrote:
>>
>>> Hi,
>>>
>>> I have had a PR  open (first
>>> draft can be found here ) for
>>> quite some time now that adds an 'axis' parameter to *count_nonzero*.
>>> While the functionality is fully in-place, very robust, and actually
>>> higher-performing than the original *count_nonzero* function, the
>>> obstacle at this point is the implementation, as most of the functionality
>>> is now surfaced at the Python level instead of at the C level.
>>>
>>> I have made several attempts to move the code into C to no avail and
>>> have not received much feedback from maintainers unfortunately to move this
>>> forward, so I'm opening this up to the mailing list to see what you guys
>>> think of the changes and whether or not it should be merged in as is or be
>>> tabled until a more C-friendly solution can be found.
>>>
>>
>> The discussion is spread over several PRs/issues, so maybe a summary is
>> useful:
>>
>> - adding an axis parameter was a feature request that was generally
>> approved of [1]
>> - writing the axis selection/validation code in C, like the rest of
>> count_nonzero, was preferred by several core devs
>> - Writing that C code turns out to be tricky. Jaime had a PR for doing
>> this for bincount [2], but closed it with final conclusion "the proper
>> approach seems to me to build some intermediate layer over nditer that
>> abstracts the complexity away".
>> - Julian pointed out that this adds a ufunc-like param, so why not add
>> other params like out/keepdims [3]
>> - Stephan points out that the current PR has quite a few branches, would
>> benefit from reusing a helper function (like _validate_axis, but that may
>> not do exactly the right thing), and that he doesn't want to merge it as is
>> without further input from other devs [4].
>>
>> Points previously not raised that I can think of:
>> - count_nonzero is also in the C API [5], the axis parameter is now only
>> added to the Python API.
>> - Part of why the code in this PR is complex is to keep performance for
>> small arrays OK, but there's no benchmarks added or result given for the
>> existing benchmark [6]. A simple check with:
>>   x = np.arange(100)
>>   %timeit np.count_nonzero(x)
>> shows that that gets about 30x slower (330 ns vs 10.5 us on my machine).
>>
>> It looks to me like performance is a concern, and if that can be resolved
>> there's the broader discussion of whether it's a good idea to merge this PR
>> at all. That's a trade-off of adding a useful feature vs. technical debt /
>> maintenance burden plus divergence Python/C API. Also, what do we do when
>> we merge this and then next week someone else sends a PR adding a keepdims
>> or out keyword? For these kinds of additions it would feel better if we
>> were sure that the new version is the final/desired one for the foreseeable
>> future.
>>
>> Ralf
>>
>>
>> [1] https://github.com/numpy/numpy/issues/391
>> [2] https://github.com/numpy/numpy/pull/4330#issuecomment-77791250
>> [3] https://github.com/numpy/numpy/pull/7138#issuecomment-177202894
>> [4] https://github.com/numpy/numpy/pull/7177
>> [5]
>> http://docs.scipy.org/doc/numpy/reference/c-api.array.html#c.PyArray_CountNonzero
>> [6]
>> 

Re: [Numpy-discussion] axis parameter for count_nonzero

2016-05-22 Thread G Young
1) Correction: The PR was not written with small arrays in mind.  I ran
some new timing tests, and it does perform worse on smaller arrays but
appears to scale better than the current implementation.

2) Let me put it out there that I am not opposed to moving it to C, but
right now, there seems to be a large technical brick wall up against such
an implementation.  So suggestions about how to move the code into C would
be welcome too!

On Sun, May 22, 2016 at 10:32 AM, Ralf Gommers 
wrote:

>
>
> On Sun, May 22, 2016 at 3:05 AM, G Young  wrote:
>
>> Hi,
>>
>> I have had a PR  open (first
>> draft can be found here ) for
>> quite some time now that adds an 'axis' parameter to *count_nonzero*.
>> While the functionality is fully in-place, very robust, and actually
>> higher-performing than the original *count_nonzero* function, the
>> obstacle at this point is the implementation, as most of the functionality
>> is now surfaced at the Python level instead of at the C level.
>>
>> I have made several attempts to move the code into C to no avail and have
>> not received much feedback from maintainers unfortunately to move this
>> forward, so I'm opening this up to the mailing list to see what you guys
>> think of the changes and whether or not it should be merged in as is or be
>> tabled until a more C-friendly solution can be found.
>>
>
> The discussion is spread over several PRs/issues, so maybe a summary is
> useful:
>
> - adding an axis parameter was a feature request that was generally
> approved of [1]
> - writing the axis selection/validation code in C, like the rest of
> count_nonzero, was preferred by several core devs
> - Writing that C code turns out to be tricky. Jaime had a PR for doing
> this for bincount [2], but closed it with final conclusion "the proper
> approach seems to me to build some intermediate layer over nditer that
> abstracts the complexity away".
> - Julian pointed out that this adds a ufunc-like param, so why not add
> other params like out/keepdims [3]
> - Stephan points out that the current PR has quite a few branches, would
> benefit from reusing a helper function (like _validate_axis, but that may
> not do exactly the right thing), and that he doesn't want to merge it as is
> without further input from other devs [4].
>
> Points previously not raised that I can think of:
> - count_nonzero is also in the C API [5], the axis parameter is now only
> added to the Python API.
> - Part of why the code in this PR is complex is to keep performance for
> small arrays OK, but there's no benchmarks added or result given for the
> existing benchmark [6]. A simple check with:
>   x = np.arange(100)
>   %timeit np.count_nonzero(x)
> shows that that gets about 30x slower (330 ns vs 10.5 us on my machine).
>
> It looks to me like performance is a concern, and if that can be resolved
> there's the broader discussion of whether it's a good idea to merge this PR
> at all. That's a trade-off of adding a useful feature vs. technical debt /
> maintenance burden plus divergence Python/C API. Also, what do we do when
> we merge this and then next week someone else sends a PR adding a keepdims
> or out keyword? For these kinds of additions it would feel better if we
> were sure that the new version is the final/desired one for the foreseeable
> future.
>
> Ralf
>
>
> [1] https://github.com/numpy/numpy/issues/391
> [2] https://github.com/numpy/numpy/pull/4330#issuecomment-77791250
> [3] https://github.com/numpy/numpy/pull/7138#issuecomment-177202894
> [4] https://github.com/numpy/numpy/pull/7177
> [5]
> http://docs.scipy.org/doc/numpy/reference/c-api.array.html#c.PyArray_CountNonzero
> [6]
> https://github.com/numpy/numpy/blob/master/benchmarks/benchmarks/bench_ufunc.py#L70
>
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> https://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Proposal: numpy.random.random_seed

2016-05-22 Thread Robert Kern
On Wed, May 18, 2016 at 7:56 PM, Nathaniel Smith  wrote:
>
> On Wed, May 18, 2016 at 5:07 AM, Robert Kern 
wrote:
> > On Wed, May 18, 2016 at 1:14 AM, Nathaniel Smith  wrote:

> >> ...anyway, the real reason I'm a bit grumpy is because there are solid
> >> engineering reasons why users *want* this API,
> >
> > I remain unconvinced on this mark. Grumpily.
>
> Sorry for getting grumpy :-).

And my apologies for some unwarranted hyperbole. I think we're both
converging on a reasonable approach, though.

> The engineering reasons seem pretty
> obvious to me though?

Well, I mean, engineers want lots of things. I suspect that most engineers
*really* just want to call `numpy.random.seed(8675309)` at the start and
never explicitly pass around separate streams. There's an upside to that in
terms of code simplicity. There are also significant limitations and
constraints. Ultimately, the upside against the alternative of passing
around RandomState objects is usually overweighed by the limitations, so
best practice is to pass around RandomState objects.

I acknowledge that there exists an upside to the splitting API, but I don't
think it's a groundbreaking improvement over the alternative current best
practice. It's also unclear to me how often situations that really
demonstrate the upside come into play; in my experience a lot of these
situations are already structured such that preallocating N streams is the
natural thing to do. The limitations and constraints are currently
underexplored, IMO; and in this conservative field, pessimism is warranted.

> If you have any use case for independent streams
> at all, and you're writing code that's intended to live inside a
> library's abstraction barrier, then you need some way to choose your
> streams to avoid colliding with arbitrary other code that the end-user
> might assemble alongside yours as part of their final program. So
> AFAICT you have two options: either you need a "tree-style" API for
> allocating these streams, or else you need to add some explicit API to
> your library that lets the end-user control in detail which streams
> you use. Both are possible, but the latter is obviously undesireable
> if you can avoid it, since it breaks the abstraction barrier, making
> your library more complicated to use and harder to evolve.

ACK

> >> so whether or not it
> >> turns out to be possible I think we should at least be allowed to have
> >> a discussion about whether there's some way to give it to them.
> >
> > I'm not shutting down discussion of the option. I *implemented* the
option.
> > I think that discussing whether it should be part of the main API is
> > premature. There probably ought to be a paper or three out there
supporting
> > its safety and utility first. Let the utility function version flourish
> > first.
>
> OK -- I guess this particularly makes sense given how
> extra-tightly-constrained we currently are in fixing mistakes in
> np.random. But I feel like in the end the right place for this really
> is inside the RandomState interface, because the person implementing
> RandomState is the one best placed to understand (a) the gnarly
> technical details here, and (b) how those change depending on the
> particular PRNG in use. I don't want to end up with a bunch of
> subtly-buggy utility functions in non-specialist libraries like dask
> -- so we should be trying to help downstream users figure out how to
> actually get this into np.random?

I think this is an open research area. An enterprising grad student could
milk this for a couple of papers analyzing how to do this safely for a
variety of PRNGs. I don't think we can hash this out in an email thread or
PR. So yeah, eventually there might be an API on RandomState for this, but
it's way too premature to do so right now, IMO. Maybe start with a
specialized subclass of RandomState that adds this experimental API. In
ng-numpy-randomstate. ;-)

But if someone has spare time to work on numpy.random, for God's sake, use
it to review @gfyoung's PRs instead.

> >> It's
> >> not even 100% out of the question that we conclude that existing PRNGs
> >> are buggy because they don't take this use case into account -- it
> >> would be far from the first time that numpy found itself going beyond
> >> the limits of older numerical tools that weren't designed to build the
> >> kind of large composable systems that numpy gets used for.
> >>
> >> MT19937's state space is large enough that you could explicitly encode
> >> a "tree seed" into it, even if you don't trust the laws of probability
> >> -- e.g., you start with a RandomState with id [], then its children
> >> have id [0], [1], [2], ..., and their children have ids [0, 0], [0,
> >> 1], ..., [1, 0], ..., and you write these into the state (probably
> >> after sending them through some bit-mixing permutation), to guarantee
> >> non-collision.
> >
> > Sure. Not entirely sure if that can be done without 

Re: [Numpy-discussion] axis parameter for count_nonzero

2016-05-22 Thread Ralf Gommers
On Sun, May 22, 2016 at 3:05 AM, G Young  wrote:

> Hi,
>
> I have had a PR  open (first
> draft can be found here ) for
> quite some time now that adds an 'axis' parameter to *count_nonzero*.
> While the functionality is fully in-place, very robust, and actually
> higher-performing than the original *count_nonzero* function, the
> obstacle at this point is the implementation, as most of the functionality
> is now surfaced at the Python level instead of at the C level.
>
> I have made several attempts to move the code into C to no avail and have
> not received much feedback from maintainers unfortunately to move this
> forward, so I'm opening this up to the mailing list to see what you guys
> think of the changes and whether or not it should be merged in as is or be
> tabled until a more C-friendly solution can be found.
>

The discussion is spread over several PRs/issues, so maybe a summary is
useful:

- adding an axis parameter was a feature request that was generally
approved of [1]
- writing the axis selection/validation code in C, like the rest of
count_nonzero, was preferred by several core devs
- Writing that C code turns out to be tricky. Jaime had a PR for doing this
for bincount [2], but closed it with final conclusion "the proper approach
seems to me to build some intermediate layer over nditer that abstracts the
complexity away".
- Julian pointed out that this adds a ufunc-like param, so why not add
other params like out/keepdims [3]
- Stephan points out that the current PR has quite a few branches, would
benefit from reusing a helper function (like _validate_axis, but that may
not do exactly the right thing), and that he doesn't want to merge it as is
without further input from other devs [4].

Points previously not raised that I can think of:
- count_nonzero is also in the C API [5], the axis parameter is now only
added to the Python API.
- Part of why the code in this PR is complex is to keep performance for
small arrays OK, but there's no benchmarks added or result given for the
existing benchmark [6]. A simple check with:
  x = np.arange(100)
  %timeit np.count_nonzero(x)
shows that that gets about 30x slower (330 ns vs 10.5 us on my machine).

It looks to me like performance is a concern, and if that can be resolved
there's the broader discussion of whether it's a good idea to merge this PR
at all. That's a trade-off of adding a useful feature vs. technical debt /
maintenance burden plus divergence Python/C API. Also, what do we do when
we merge this and then next week someone else sends a PR adding a keepdims
or out keyword? For these kinds of additions it would feel better if we
were sure that the new version is the final/desired one for the foreseeable
future.

Ralf


[1] https://github.com/numpy/numpy/issues/391
[2] https://github.com/numpy/numpy/pull/4330#issuecomment-77791250
[3] https://github.com/numpy/numpy/pull/7138#issuecomment-177202894
[4] https://github.com/numpy/numpy/pull/7177
[5]
http://docs.scipy.org/doc/numpy/reference/c-api.array.html#c.PyArray_CountNonzero
[6]
https://github.com/numpy/numpy/blob/master/benchmarks/benchmarks/bench_ufunc.py#L70
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion