On Tue, Dec 11, 2018 at 1:37 PM Warren Weckesser <warren.weckes...@gmail.com> wrote:
> > > On Tue, Dec 11, 2018 at 10:32 AM Ralf Gommers <ralf.gomm...@gmail.com> > wrote: > >> >> >> On Mon, Dec 10, 2018 at 10:27 AM Warren Weckesser < >> warren.weckes...@gmail.com> wrote: >> >>> >>> >>> On 12/10/18, Ralf Gommers <ralf.gomm...@gmail.com> wrote: >>> > On Sun, Dec 9, 2018 at 2:00 PM Alan Isaac <alan.is...@gmail.com> >>> wrote: >>> > >>> >> I believe this was proposed in the past to little enthusiasm, >>> >> with the response, "you're using a library; learn its functions". >>> >> >>> > >>> > Not only that, NumPy and the core libraries around it are the standard >>> for >>> > numerical/statistical computing. If core Python devs want to replicate >>> a >>> > small subset of that functionality in a new Python version like 3.6, it >>> > would be sensible for them to choose compatible names. I don't think >>> > there's any justification for us to bother our users based on new >>> things >>> > that get added to the stdlib. >>> > >>> > >>> >> Nevertheless, given the addition of `choices` to the Python >>> >> random module in 3.6, it would be nice to have the *same name* >>> >> for parallel functionality in numpy.random. >>> >> >>> >> And given the redundancy of numpy.random.sample, it would be >>> >> nice to deprecate it with the intent to reintroduce >>> >> the name later, better aligned with Python's usage. >>> >> >>> > >>> > No, there is nothing wrong with the current API, so I'm -10 on >>> deprecating >>> > it. >>> >>> Actually, the `numpy.random.choice` API has one major weakness. When >>> `replace` is False and `size` is greater than 1, the function is actually >>> drawing a *one* sample from a multivariate distribution. For the other >>> multivariate distributions (multinomial, multivariate_normal and >>> dirichlet), `size` sets the number of samples to draw from the >>> distribution. With `replace=False` in `choice`, size becomes a *parameter* >>> of the distribution, and it is only possible to draw one (multivariate) >>> sample. >>> >> >> I'm not sure I follow. `choice` draws samples from a given 1-D array, >> more than 1: >> >> In [12]: np.random.choice(np.arange(5), size=2, replace=True) >> Out[12]: array([2, 2]) >> >> In [13]: np.random.choice(np.arange(5), size=2, replace=False) >> Out[13]: array([3, 0]) >> >> The multivariate distribution you're talking about is for generating the >> indices I assume. Does the current implementation actually give a result >> for size>1 that has different statistic properties from calling the >> function N times with size=1? If so, that's definitely worth a bug report >> at least (I don't think there is one for this). >> >> > There is no bug, just a limitation in the API. > > When I draw without replacement, say, three values from a collection of > length five, the three values that I get are not independent. So really, > this is *one* sample from a three-dimensional (discrete-valued) > distribution. The problem with the current API is that I can't get > multiple samples from this three-dimensional distribution in one call. If > I need to repeat the process six times, I have to use a loop, e.g.: > > >>> samples = [np.random.choice([10, 20, 30, 40, 50], replace=False, > size=3) for _ in range(6)] > > With the `select` function I described in my previous email, which I'll > call `random_select` here, the parameter that determines the number of > items per sample, `nsample`, is separate from the parameter that determines > the number of samples, `size`: > > >>> samples = random_select([10, 20, 30, 40, 50], nsample=3, size=6) > >>> samples > array([[30, 40, 50], > [40, 50, 30], > [10, 20, 40], > [20, 30, 50], > [40, 20, 50], > [20, 10, 30]]) > > (`select` is a really bad name, since `numpy.select` already exists and is > something completely different. I had the longer name `random.select` in > mind when I started using it. "There are only two hard problems..." etc.) > > As I reread this, I see another naming problem: "sample" is used to mean different things. In my description above, I referred to one "sample" as the length-3 sequence generated by one call to `numpy.random.choice([10, 20, 30, 40, 50], replace=False, size=3)`, but in `random_select`, `nsample` refers to the length of each sequence generated. I use the name 'nsample' to be consistent with `numpy.random.hypergeometric`. I hope the output of the `random_select` call shown above makes clear the desired behavior. Warren Warren > > > >> Cheers, >> Ralf >> >> >> >>> I thought about this some time ago, and came up with an API that >>> eliminates the boolean flag, and separates the `size` argument from the >>> number of items drawn in one sample, which I'll call `nsample`. To avoid >>> creating a "false friend" with the standard library and with numpy's >>> `choice`, I'll call this function `select`. >>> >>> Here's the proposed signature and docstring. (A prototype >>> implementation is in a gist at >>> https://gist.github.com/WarrenWeckesser/2e5905d116e710914af383ee47adc2bf.) >>> The key feature is the `nsample` argument, which sets how many items to >>> select from the given collection. Also note that this function is *always* >>> drawing *without replacement*. It covers the `replace=True` case because >>> drawing one item without replacement is the same as drawing one item with >>> replacement. >>> >>> Whether or not an API like the following is used, it would be nice if >>> there was some way to get multiple samples in the `replace=False` case in >>> one function call. >>> >>> def select(items, nsample=None, p=None, size=None): >>> """ >>> Select random samples from `items`. >>> >>> The function randomly selects `nsample` items from `items` without >>> replacement. >>> >>> Parameters >>> ---------- >>> items : sequence >>> The collection of items from which the selection is made. >>> nsample : int, optional >>> Number of items to select without replacement in each draw. >>> It must be between 0 and len(items), inclusize. >>> p : array-like of floats, same length as `items, optional >>> Probabilities of the items. If this argument is not given, >>> the elements in `items` are assumed to have equal probability. >>> size : int, optional >>> Number of variates to draw. >>> >>> Notes >>> ----- >>> `size=None` means "generate a single selection". >>> >>> If `size` is None, the result is equivalent to >>> numpy.random.choice(items, size=nsample, replace=False) >>> >>> `nsample=None` means draw one (scalar) sample. >>> If `nsample` is None, the functon acts (almost) like nsample=1 (see >>> below for more information), and the result is equivalent to >>> numpy.random.choice(items, size=size) >>> In effect, it does choice with replacement. The case `nsample=None` >>> can be interpreted as each sample is a scalar, and `nsample=k` >>> means each sample is a sequence with length k. >>> >>> If `nsample` is not None, it must be a nonnegative integer with >>> 0 <= nsample <= len(items). >>> >>> If `size` is not None, it must be an integer or a tuple of integers. >>> When `size` is an integer, it is treated as the tuple ``(size,)``. >>> >>> When both `nsample` and `size` are not None, the result >>> has shape ``size + (nsample,)``. >>> >>> Examples >>> -------- >>> Make 6 choices with replacement from [10, 20, 30, 40]. (This is >>> equivalent to "Make 1 choice without replacement from [10, 20, 30, >>> 40]; >>> do it six times.") >>> >>> >>> select([10, 20, 30, 40], size=6) >>> array([20, 20, 40, 10, 40, 30]) >>> >>> Choose two items from [10, 20, 30, 40] without replacement. Do it >>> six >>> times. >>> >>> >>> select([10, 20, 30, 40], nsample=2, size=6) >>> array([[40, 10], >>> [20, 30], >>> [10, 40], >>> [30, 10], >>> [10, 30], >>> [10, 20]]) >>> >>> When `nsample` is an integer, there is always an axis at the end of >>> the >>> result with length `nsample`, even when `nsample=1`. For example, >>> the >>> shape of the array returned in the following call is (2, 3, 1) >>> >>> >>> select([10, 20, 30, 40], nsample=1, size=(2, 3)) >>> array([[[10], >>> [30], >>> [20]], >>> >>> [[10], >>> [40], >>> [20]]]) >>> >>> When `nsample` is None, it acts like `nsample=1`, but the trivial >>> dimension is not included. The shape of the array returned in the >>> following call is (2, 3). >>> >>> >>> select([10, 20, 30, 40], size=(2, 3)) >>> array([[20, 40, 30], >>> [30, 20, 40]]) >>> >>> """ >>> >>> >>> Warren >>> >>> _______________________________________________ >> NumPy-Discussion mailing list >> NumPy-Discussion@python.org >> https://mail.python.org/mailman/listinfo/numpy-discussion >> >
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion