Re: [Numpy-discussion] align `choices` and `sample` with Python `random` module

Warren Weckesser Tue, 11 Dec 2018 10:38:15 -0800

On Tue, Dec 11, 2018 at 10:32 AM Ralf Gommers <ralf.gomm...@gmail.com>
wrote:


>
>
> On Mon, Dec 10, 2018 at 10:27 AM Warren Weckesser <
> warren.weckes...@gmail.com> wrote:
>
>>
>>
>> On 12/10/18, Ralf Gommers <ralf.gomm...@gmail.com> wrote:
>> > On Sun, Dec 9, 2018 at 2:00 PM Alan Isaac <alan.is...@gmail.com> wrote:
>> >
>> >> I believe this was proposed in the past to little enthusiasm,
>> >> with the response, "you're using a library; learn its functions".
>> >>
>> >
>> > Not only that, NumPy and the core libraries around it are the standard
>> for
>> > numerical/statistical computing. If core Python devs want to replicate a
>> > small subset of that functionality in a new Python version like 3.6, it
>> > would be sensible for them to choose compatible names. I don't think
>> > there's any justification for us to bother our users based on new things
>> > that get added to the stdlib.
>> >
>> >
>> >> Nevertheless, given the addition of `choices` to the Python
>> >> random module in 3.6, it would be nice to have the *same name*
>> >> for parallel functionality in numpy.random.
>> >>
>> >> And given the redundancy of numpy.random.sample, it would be
>> >> nice to deprecate it with the intent to reintroduce
>> >> the name later, better aligned with Python's usage.
>> >>
>> >
>> > No, there is nothing wrong with the current API, so I'm -10 on
>> deprecating
>> > it.
>>
>> Actually, the `numpy.random.choice` API has one major weakness.  When
>> `replace` is False and `size` is greater than 1, the function is actually
>> drawing a *one* sample from a multivariate distribution.  For the other
>> multivariate distributions (multinomial, multivariate_normal and
>> dirichlet), `size` sets the number of samples to draw from the
>> distribution.  With `replace=False` in `choice`, size becomes a *parameter*
>> of the distribution, and it is only possible to draw one (multivariate)
>> sample.
>>
>
> I'm not sure I follow. `choice` draws samples from a given 1-D array, more
> than 1:
>
> In [12]: np.random.choice(np.arange(5), size=2, replace=True)
> Out[12]: array([2, 2])
>
> In [13]: np.random.choice(np.arange(5), size=2, replace=False)
> Out[13]: array([3, 0])
>
> The multivariate distribution you're talking about is for generating the
> indices I assume. Does the current implementation actually give a result
> for size>1 that has different statistic properties from calling the
> function N times with size=1? If so, that's definitely worth a bug report
> at least (I don't think there is one for this).
>
>
There is no bug, just a limitation in the API.

When I draw without replacement, say, three values from a collection of
length five, the three values that I get are not independent.  So really,
this is *one* sample from a three-dimensional (discrete-valued)
distribution.  The problem with the current API is that I can't get
multiple samples from this three-dimensional distribution in one call.  If
I need to repeat the process six times, I have to use a loop, e.g.:

    >>> samples = [np.random.choice([10, 20, 30, 40, 50], replace=False,
size=3) for _ in range(6)]

With the `select` function I described in my previous email, which I'll
call `random_select` here, the parameter that determines the number of
items per sample, `nsample`, is separate from the parameter that determines
the number of samples, `size`:

    >>> samples = random_select([10, 20, 30, 40, 50], nsample=3, size=6)
    >>> samples
    array([[30, 40, 50],
           [40, 50, 30],
           [10, 20, 40],
           [20, 30, 50],
           [40, 20, 50],
           [20, 10, 30]])

(`select` is a really bad name, since `numpy.select` already exists and is
something completely different.  I had the longer name `random.select` in
mind when I started using it. "There are only two hard problems..." etc.)

Warren



> Cheers,
> Ralf
>
>
>
>> I thought about this some time ago, and came up with an API that
>> eliminates the boolean flag, and separates the `size` argument from the
>> number of items drawn in one sample, which I'll call `nsample`. To avoid
>> creating a "false friend" with the standard library and with numpy's
>> `choice`, I'll call this function `select`.
>>
>> Here's the proposed signature and docstring.  (A prototype implementation
>> is in a gist at
>> https://gist.github.com/WarrenWeckesser/2e5905d116e710914af383ee47adc2bf.)
>> The key feature is the `nsample` argument, which sets how many items to
>> select from the given collection.  Also note that this function is *always*
>> drawing *without replacement*.  It covers the `replace=True` case because
>> drawing one item without replacement is the same as drawing one item with
>> replacement.
>>
>> Whether or not an API like the following is used, it would be nice if
>> there was some way to get multiple samples in the `replace=False` case in
>> one function call.
>>
>> def select(items, nsample=None, p=None, size=None):
>>     """
>>     Select random samples from `items`.
>>
>>     The function randomly selects `nsample` items from `items` without
>>     replacement.
>>
>>     Parameters
>>     ----------
>>     items : sequence
>>         The collection of items from which the selection is made.
>>     nsample : int, optional
>>         Number of items to select without replacement in each draw.
>>         It must be between 0 and len(items), inclusize.
>>     p : array-like of floats, same length as `items, optional
>>         Probabilities of the items.  If this argument is not given,
>>         the elements in `items` are assumed to have equal probability.
>>     size : int, optional
>>         Number of variates to draw.
>>
>>     Notes
>>     -----
>>     `size=None` means "generate a single selection".
>>
>>     If `size` is None, the result is equivalent to
>>         numpy.random.choice(items, size=nsample, replace=False)
>>
>>     `nsample=None` means draw one (scalar) sample.
>>     If `nsample` is None, the functon acts (almost) like nsample=1 (see
>>     below for more information), and the result is equivalent to
>>         numpy.random.choice(items, size=size)
>>     In effect, it does choice with replacement.  The case `nsample=None`
>>     can be interpreted as each sample is a scalar, and `nsample=k`
>>     means each sample is a sequence with length k.
>>
>>     If `nsample` is not None, it must be a nonnegative integer with
>>     0 <= nsample <= len(items).
>>
>>     If `size` is not None, it must be an integer or a tuple of integers.
>>     When `size` is an integer, it is treated as the tuple ``(size,)``.
>>
>>     When both `nsample` and `size` are not None, the result
>>     has shape ``size + (nsample,)``.
>>
>>     Examples
>>     --------
>>     Make 6 choices with replacement from [10, 20, 30, 40].  (This is
>>     equivalent to "Make 1 choice without replacement from [10, 20, 30,
>> 40];
>>     do it six times.")
>>
>>     >>> select([10, 20, 30, 40], size=6)
>>     array([20, 20, 40, 10, 40, 30])
>>
>>     Choose two items from [10, 20, 30, 40] without replacement.  Do it six
>>     times.
>>
>>     >>> select([10, 20, 30, 40], nsample=2, size=6)
>>     array([[40, 10],
>>            [20, 30],
>>            [10, 40],
>>            [30, 10],
>>            [10, 30],
>>            [10, 20]])
>>
>>     When `nsample` is an integer, there is always an axis at the end of
>> the
>>     result with length `nsample`, even when `nsample=1`.  For example, the
>>     shape of the array returned in the following call is (2, 3, 1)
>>
>>     >>> select([10, 20, 30, 40], nsample=1, size=(2, 3))
>>     array([[[10],
>>             [30],
>>             [20]],
>>
>>            [[10],
>>             [40],
>>             [20]]])
>>
>>     When `nsample` is None, it acts like `nsample=1`, but the trivial
>>     dimension is not included.  The shape of the array returned in the
>>     following call is (2, 3).
>>
>>     >>> select([10, 20, 30, 40], size=(2, 3))
>>     array([[20, 40, 30],
>>            [30, 20, 40]])
>>
>>     """
>>
>>
>> Warren
>>
>> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion@python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>

_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] align `choices` and `sample` with Python `random` module

Reply via email to