This seems like the catch-all if you're unsure. I'm general, the purely technical discussion stays with the PR.
On Fri, Dec 24, 2021, 20:06 Jonathan Crall <erote...@gmail.com> wrote: > Yes, #9211 <https://github.com/numpy/numpy/pull/9211> is the open PR for > weighted quantiles. Is this something I should make an issue for on the > numpy github? Or is the correct place to discuss it on this mailing list? > I'd like to link to this conversation in two other places on github, but > that's difficult when discussion is on the mailing list. But if it's more > appropriate to talk here, let me know. > > On Thu, Dec 23, 2021 at 2:29 PM Joseph Fox-Rabinovitz < > jfoxrabinov...@gmail.com> wrote: > >> For what it's worth, I've looked into this a long time ago. The missing >> ingredient has always been weighted quantiles. If I'm not mistaken, the >> interface already exists, but raises an error. I've had it on my back >> burner to provide an O(n) C implementation of weighted introselect, but >> never quite got around to it. I think there has been work to add a O(n log >> n) implementation recently. >> >> - Joe >> >> On Thu, Dec 23, 2021 at 1:19 PM Jonathan Crall <erote...@gmail.com> >> wrote: >> >>> While it does feel like this might be more scipy-ish than numpy-ish, >>> numpy has an existing histogram method, with existing heuristics for >>> choosing a number of bins automatically, with existing support for weights. >>> What it is lacking is support for weights and a heuristic jointly. This >>> proposal is not a massive new feature for numpy. It is just plugging a hole >>> that exists in the cross product of possible argument combinations for >>> np.histogram. >>> >>> Thank you for the pointer about interpretation of weights. That was >>> something I felt was going to be a nuance of this, but I didn't have the >>> words to describe it. >>> >>> Within pure numpy, I think it should be possible to compute multiple >>> histograms and then aggregate them. That seems to lend itself towards >>> frequency weights, but it seems to me that probability weights would use >>> the same procedure to estimate bandwidth. >>> >>> >>> https://stats.stackexchange.com/questions/354689/are-frequency-weights-and-sampling-weights-in-practice-the-same-thing >>> >>> And ultimately, this is just an estimator used as a convenience for >>> programmers. Most real applications will need to define their bins wrt >>> their problem, but I think if it makes sense for numpy to provide a >>> heuristic baseline for un-weighted data, then it is natural to assume it >>> would do so for weighted data as well. >>> >>> >>> >>> >>> On Mon, Dec 13, 2021 at 4:03 AM Kevin Sheppard < >>> kevin.k.shepp...@gmail.com> wrote: >>> >>>> To me, this feels like it might be a better fit for SciPy or possibly >>>> statsmodels (but maybe not since neither have histogram functions >>>> anymore).The challenge with weighted estimators is how the weights should >>>> be interpreted. Stata covers the most important cases of weights >>>> https://www.reed.edu/psychology/stata/gs/tutorials/weights.html. >>>> Would these be frequency weights? Stata supports only frequency weights >>>> https://www.stata.com/manuals/u11.pdf#u11.1.6weight. >>>> >>>> Kevin >>>> >>>> >>>> On Sun, Dec 12, 2021 at 9:45 AM Jonathan Crall <erote...@gmail.com> >>>> wrote: >>>> >>>>> Hi all, this is my first post on this mailing list. >>>>> >>>>> I'm writing to propose a method for extending the histogram bandwidth >>>>> estimators to work with weighted data. I originally submitted this >>>>> proposal >>>>> to seaborn: https://github.com/mwaskom/seaborn/issues/2710 and >>>>> mwaskom suggested I take it here. >>>>> >>>>> Currently the unweighted auto heuristic is a combination of >>>>> the Freedman-Diaconis and Sturges estimator. For reference, these rules >>>>> are >>>>> as follows: >>>>> >>>>> Sturges: return the peak-to-peak ptp=(i.e. x.max() - x.min()) and >>>>> number of data points total=x.size. Then divide ptp by the log of one plus >>>>> the number of data points. >>>>> >>>>> ptp / log2(total + 2) >>>>> >>>>> Freedman-Diaconis: Find the interquartile-range of the data >>>>> iqr=(np.subtract(*np.percentile(x, [75, 25]))) and the number of data >>>>> points total=x.size, then apply the formula: >>>>> >>>>> 2.0 * iqr * total ** (-1.0 / 3.0). >>>>> >>>>> Taking a look at these it seems (please correct me if I'm missing >>>>> something that makes this not work) that there is a simple extension to >>>>> weighted data. If we can find a weighted replacement for p2p, total, and >>>>> iqr, the formulas should work exactly the same in the weighted case. >>>>> >>>>> The p2p case seems easy. Even if the data points are weighed, that >>>>> doesn't change the min and max. Nothing changes here. >>>>> >>>>> For total, instead of taking the size of the array (which implicitly >>>>> assumes each data point has a weight of 1), just sum the weight to get >>>>> total=weights.sum(). >>>>> >>>>> I believe the IQR is also computable in the weighted case. >>>>> >>>>> import numpy as np >>>>> n = 10 >>>>> rng = np.random.RandomState(12554) >>>>> >>>>> >>>>> x = rng.rand(n) >>>>> w = rng.rand(n) >>>>> >>>>> >>>>> sorted_idxs = x.argsort() >>>>> x_sort = x[sorted_idxs] >>>>> w_sort = w[sorted_idxs] >>>>> >>>>> >>>>> cumtotal = w_sort.cumsum() >>>>> quantiles = cumtotal / cumtotal[-1] >>>>> idx2, idx1 = np.searchsorted(quantiles, [0.75, 0.25]) >>>>> iqr_weighted = x_sort[idx2] - x_sort[idx1] >>>>> print('iqr_weighted = {!r}'.format(iqr_weighted)) >>>>> >>>>> >>>>> # test this is the roughtly the same for the "unweighted case" >>>>> # (wont be exactly the same because this method does not have >>>>> interpolation) >>>>> w = np.ones_like(x) >>>>> >>>>> >>>>> w_sort = w[sorted_idxs] >>>>> cumtotal = w_sort.cumsum() >>>>> quantiles = cumtotal / cumtotal[-1] >>>>> idx2, idx1 = np.searchsorted(quantiles, [0.75, 0.25]) >>>>> iqr_weighted = x_sort[idx2] - x_sort[idx1] >>>>> iqr_unweighted_repo = x_sort[idx2] - x_sort[idx1] >>>>> print('iqr_unweighted_repo = {!r}'.format(iqr_unweighted_repo)) >>>>> >>>>> >>>>> iqr_unweighted_orig = np.subtract(*np.percentile(x, [75, 25])) >>>>> print('iqr_unweighted_orig = {!r}'.format(iqr_unweighted_orig)) >>>>> >>>>> >>>>> This quick and dirty method if weighted quantiles give a close result >>>>> (which is probably fine for a bandwidth estimator): >>>>> >>>>> iqr_weighted = 0.21964093625695036 >>>>> iqr_unweighted_repo = 0.36649977003903755 >>>>> iqr_unweighted_orig = 0.30888312408540963 >>>>> >>>>> And I do see there is an open issue / PR for >>>>> weighted quantiles/percentiles: >>>>> https://github.com/numpy/numpy/issues/8935 >>>>> https://github.com/numpy/numpy/pull/9211 so this code could make use >>>>> of that after it lands. >>>>> >>>>> Lastly, I think the most common case (or at least my case) for using a >>>>> weighted histogram is to combine multiple histograms. In this case the >>>>> number of estimated bins might be greater than the number of weighted data >>>>> points, and a simple min condition on that number and the estimated number >>>>> of bins should take care of that. >>>>> >>>>> Please let me know: thoughts / opinions / ideas on this topic. I did >>>>> do some searching for related discussion, but I may have missed it, so >>>>> point me to that if I missed it. Also if the reason this feature does not >>>>> exist is because there is some theoretical problem with estimating >>>>> bandwidth for weighted data that I'm unaware of, I'd be interested to >>>>> learn >>>>> about that (although I can't see that being the case because these are >>>>> just >>>>> heuristics after all, and I have validated that this works well in my own >>>>> use-cases). >>>>> >>>>> -- >>>>> -Dr. Jon Crall (him) >>>>> _______________________________________________ >>>>> NumPy-Discussion mailing list -- numpy-discussion@python.org >>>>> To unsubscribe send an email to numpy-discussion-le...@python.org >>>>> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ >>>>> Member address: kevin.k.shepp...@gmail.com >>>>> >>>> _______________________________________________ >>>> NumPy-Discussion mailing list -- numpy-discussion@python.org >>>> To unsubscribe send an email to numpy-discussion-le...@python.org >>>> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ >>>> Member address: erote...@gmail.com >>>> >>> >>> >>> -- >>> -Dr. Jon Crall (him) >>> _______________________________________________ >>> NumPy-Discussion mailing list -- numpy-discussion@python.org >>> To unsubscribe send an email to numpy-discussion-le...@python.org >>> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ >>> Member address: jfoxrabinov...@gmail.com >>> >> _______________________________________________ >> NumPy-Discussion mailing list -- numpy-discussion@python.org >> To unsubscribe send an email to numpy-discussion-le...@python.org >> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ >> Member address: erote...@gmail.com >> > > > -- > -Dr. Jon Crall (him) > _______________________________________________ > NumPy-Discussion mailing list -- numpy-discussion@python.org > To unsubscribe send an email to numpy-discussion-le...@python.org > https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ > Member address: jfoxrabinov...@gmail.com >
_______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com