[Numpy-discussion] Re: Proposal: Automatic estimation of number of histogram bins for weighted data

Joseph Fox-Rabinovitz Fri, 24 Dec 2021 19:09:07 -0800

This seems like the catch-all if you're unsure. I'm general, the purely
technical discussion stays with the PR.


On Fri, Dec 24, 2021, 20:06 Jonathan Crall <erote...@gmail.com> wrote:

> Yes, #9211 <https://github.com/numpy/numpy/pull/9211> is the open PR for
> weighted quantiles. Is this something I should make an issue for on the
> numpy github? Or is the correct place to discuss it on this mailing list?
> I'd like to link to this conversation in two other places on github, but
> that's difficult when discussion is on the mailing list. But if it's more
> appropriate to talk here, let me know.
>
> On Thu, Dec 23, 2021 at 2:29 PM Joseph Fox-Rabinovitz <
> jfoxrabinov...@gmail.com> wrote:
>
>> For what it's worth, I've looked into this a long time ago. The missing
>> ingredient has always been weighted quantiles. If I'm not mistaken, the
>> interface already exists, but raises an error. I've had it on my back
>> burner to provide an O(n) C implementation of weighted introselect, but
>> never quite got around to it. I think there has been work to add a O(n log
>> n) implementation recently.
>>
>> - Joe
>>
>> On Thu, Dec 23, 2021 at 1:19 PM Jonathan Crall <erote...@gmail.com>
>> wrote:
>>
>>> While it does feel like this might be more scipy-ish than numpy-ish,
>>> numpy has an existing histogram method, with existing heuristics for
>>> choosing a number of bins automatically, with existing support for weights.
>>> What it is lacking is support for weights and a heuristic jointly. This
>>> proposal is not a massive new feature for numpy. It is just plugging a hole
>>> that exists in the cross product of possible argument combinations for
>>> np.histogram.
>>>
>>> Thank you for the pointer about interpretation of weights. That was
>>> something I felt was going to be a nuance of this, but I didn't have the
>>> words to describe it.
>>>
>>> Within pure numpy, I think it should be possible to compute multiple
>>> histograms and then aggregate them. That seems to lend itself towards
>>> frequency weights, but it seems to me that probability weights would use
>>> the same procedure to estimate bandwidth.
>>>
>>>
>>> https://stats.stackexchange.com/questions/354689/are-frequency-weights-and-sampling-weights-in-practice-the-same-thing
>>>
>>> And ultimately, this is just an estimator used as a convenience for
>>> programmers. Most real applications will need to define their bins wrt
>>> their problem, but I think if it makes sense for numpy to provide a
>>> heuristic baseline for un-weighted data, then it is natural to assume it
>>> would do so for weighted data as well.
>>>
>>>
>>>
>>>
>>> On Mon, Dec 13, 2021 at 4:03 AM Kevin Sheppard <
>>> kevin.k.shepp...@gmail.com> wrote:
>>>
>>>> To me, this feels like it might be a better fit for SciPy or possibly
>>>> statsmodels (but maybe not since neither have histogram functions
>>>> anymore).The challenge with weighted estimators is how the weights should
>>>> be interpreted. Stata covers the most important cases of weights
>>>> https://www.reed.edu/psychology/stata/gs/tutorials/weights.html.
>>>> Would these be frequency weights?  Stata supports only frequency weights
>>>> https://www.stata.com/manuals/u11.pdf#u11.1.6weight.
>>>>
>>>> Kevin
>>>>
>>>>
>>>> On Sun, Dec 12, 2021 at 9:45 AM Jonathan Crall <erote...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi all, this is my first post on this mailing list.
>>>>>
>>>>> I'm writing to propose a method for extending the histogram bandwidth
>>>>> estimators to work with weighted data. I originally submitted this 
>>>>> proposal
>>>>> to seaborn: https://github.com/mwaskom/seaborn/issues/2710 and
>>>>> mwaskom suggested I take it here.
>>>>>
>>>>> Currently the unweighted auto heuristic is a combination of
>>>>> the Freedman-Diaconis and Sturges estimator. For reference, these rules 
>>>>> are
>>>>> as follows:
>>>>>
>>>>> Sturges: return the peak-to-peak ptp=(i.e. x.max() - x.min()) and
>>>>> number of data points total=x.size. Then divide ptp by the log of one plus
>>>>> the number of data points.
>>>>>
>>>>> ptp / log2(total + 2)
>>>>>
>>>>> Freedman-Diaconis: Find the interquartile-range of the data
>>>>> iqr=(np.subtract(*np.percentile(x, [75, 25]))) and the number of data
>>>>> points total=x.size, then apply the formula:
>>>>>
>>>>> 2.0 * iqr * total ** (-1.0 / 3.0).
>>>>>
>>>>> Taking a look at these it seems (please correct me if I'm missing
>>>>> something that makes this not work) that there is a simple extension to
>>>>> weighted data. If we can find a weighted replacement for p2p, total, and
>>>>> iqr, the formulas should work exactly the same in the weighted case.
>>>>>
>>>>> The p2p case seems easy. Even if the data points are weighed, that
>>>>> doesn't change the min and max. Nothing changes here.
>>>>>
>>>>> For total, instead of taking the size of the array (which implicitly
>>>>> assumes each data point has a weight of 1), just sum the weight to get
>>>>> total=weights.sum().
>>>>>
>>>>> I believe the IQR is also computable in the weighted case.
>>>>>
>>>>> import numpy as np
>>>>> n = 10
>>>>> rng = np.random.RandomState(12554)
>>>>>
>>>>>
>>>>> x = rng.rand(n)
>>>>> w = rng.rand(n)
>>>>>
>>>>>
>>>>> sorted_idxs = x.argsort()
>>>>> x_sort = x[sorted_idxs]
>>>>> w_sort = w[sorted_idxs]
>>>>>
>>>>>
>>>>> cumtotal = w_sort.cumsum()
>>>>> quantiles = cumtotal / cumtotal[-1]
>>>>> idx2, idx1 = np.searchsorted(quantiles, [0.75, 0.25])
>>>>> iqr_weighted = x_sort[idx2] - x_sort[idx1]
>>>>> print('iqr_weighted = {!r}'.format(iqr_weighted))
>>>>>
>>>>>
>>>>> # test this is the roughtly the same for the "unweighted case"
>>>>> # (wont be exactly the same because this method does not have
>>>>> interpolation)
>>>>> w = np.ones_like(x)
>>>>>
>>>>>
>>>>> w_sort = w[sorted_idxs]
>>>>> cumtotal = w_sort.cumsum()
>>>>> quantiles = cumtotal / cumtotal[-1]
>>>>> idx2, idx1 = np.searchsorted(quantiles, [0.75, 0.25])
>>>>> iqr_weighted = x_sort[idx2] - x_sort[idx1]
>>>>> iqr_unweighted_repo = x_sort[idx2] - x_sort[idx1]
>>>>> print('iqr_unweighted_repo = {!r}'.format(iqr_unweighted_repo))
>>>>>
>>>>>
>>>>> iqr_unweighted_orig = np.subtract(*np.percentile(x, [75, 25]))
>>>>> print('iqr_unweighted_orig = {!r}'.format(iqr_unweighted_orig))
>>>>>
>>>>>
>>>>> This quick and dirty method if weighted quantiles give a close result
>>>>> (which is probably fine for a bandwidth estimator):
>>>>>
>>>>> iqr_weighted = 0.21964093625695036
>>>>> iqr_unweighted_repo = 0.36649977003903755
>>>>> iqr_unweighted_orig = 0.30888312408540963
>>>>>
>>>>> And I do see there is an open issue / PR for
>>>>> weighted quantiles/percentiles:
>>>>> https://github.com/numpy/numpy/issues/8935
>>>>> https://github.com/numpy/numpy/pull/9211 so this code could make use
>>>>> of that after it lands.
>>>>>
>>>>> Lastly, I think the most common case (or at least my case) for using a
>>>>> weighted histogram is to combine multiple histograms. In this case the
>>>>> number of estimated bins might be greater than the number of weighted data
>>>>> points, and a simple min condition on that number and the estimated number
>>>>> of bins should take care of that.
>>>>>
>>>>> Please let me know: thoughts / opinions / ideas on this topic. I did
>>>>> do some searching for related discussion, but I may have missed it, so
>>>>> point me to that if I missed it. Also if the reason this feature does not
>>>>> exist is because there is some theoretical problem with estimating
>>>>> bandwidth for weighted data that I'm unaware of, I'd be interested to 
>>>>> learn
>>>>> about that (although I can't see that being the case because these are 
>>>>> just
>>>>> heuristics after all, and I have validated that this works well in my own
>>>>> use-cases).
>>>>>
>>>>> --
>>>>> -Dr. Jon Crall (him)
>>>>> _______________________________________________
>>>>> NumPy-Discussion mailing list -- numpy-discussion@python.org
>>>>> To unsubscribe send an email to numpy-discussion-le...@python.org
>>>>> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
>>>>> Member address: kevin.k.shepp...@gmail.com
>>>>>
>>>> _______________________________________________
>>>> NumPy-Discussion mailing list -- numpy-discussion@python.org
>>>> To unsubscribe send an email to numpy-discussion-le...@python.org
>>>> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
>>>> Member address: erote...@gmail.com
>>>>
>>>
>>>
>>> --
>>> -Dr. Jon Crall (him)
>>> _______________________________________________
>>> NumPy-Discussion mailing list -- numpy-discussion@python.org
>>> To unsubscribe send an email to numpy-discussion-le...@python.org
>>> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
>>> Member address: jfoxrabinov...@gmail.com
>>>
>> _______________________________________________
>> NumPy-Discussion mailing list -- numpy-discussion@python.org
>> To unsubscribe send an email to numpy-discussion-le...@python.org
>> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
>> Member address: erote...@gmail.com
>>
>
>
> --
> -Dr. Jon Crall (him)
> _______________________________________________
> NumPy-Discussion mailing list -- numpy-discussion@python.org
> To unsubscribe send an email to numpy-discussion-le...@python.org
> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
> Member address: jfoxrabinov...@gmail.com
>

_______________________________________________
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] Re: Proposal: Automatic estimation of number of histogram bins for weighted data

Reply via email to