It’s late and I’m probably missing something The issue is not one of range as you showed there, but of precision. Here’s the test case you’re missing:
def get_err(u64): """ return the absolute error incurred by storing a uint64 in a float64 "" u64 = np.uint64(u64) return u64 - u64.astype(np.float64).astype(np.uint64) The problem starts appearing with >>> get_err(2**53 + 1)1 and only gets worse as the size of the integers increases >>> get_err(2**64 - 2*10)9223372036854775788 # this is a lot bigger than >>> float64.eps (although as a relative error, it's similar) Either way, such weights don’t really happen in real code I think. The counterexample I can think of is someone trying to implement fixed-precision arithmetic with large integers. The intersection of people doing both that and histogramdd is probably very small, but it’s at least plausible. Yes, there are cross-links to Python, SciPy and Matplotlib functions in the docs. Great, that was what I was unsure of. I was worried that linking to upstream projects would be sort of weird, but practicality beats purity for sure here. Eric On Fri, 27 Apr 2018 at 22:26 Ralf Gommers <ralf.gomm...@gmail.com> wrote: > On Wed, Apr 25, 2018 at 11:00 PM, Eric Wieser <wieser.eric+nu...@gmail.com > > wrote: > >> For precision loss of the order of float64 eps, I disagree. >> >> I was thinking more about precision loss on the order of 1, for large >> 64-bit integers that can’t fit in a float64 >> > It's late and I'm probably missing something, but: > > >>> np.iinfo(np.int64).max > np.finfo(np.float64).max > False > > Either way, such weights don't really happen in real code I think. > > >> Note also that #10864 <https://github.com/numpy/numpy/issues/10864> >> incurs deliberate precision loss of the order 10**-6 x smallest bin, which >> is also much larger than eps. >> > Yeah that's worse. > > >> It’s also possible to refer users to scipy.stats.binned_statistic >> >> That sounds like a good idea to do irrespective of whether histogramdd >> has problems - I had no idea those existed. Is there a precedent for >> referring to more feature-rich scipy functions from the basic numpy ones? >> > Yes, there are cross-links to Python, SciPy and Matplotlib functions in > the docs. This is done with intersphinx ( > https://github.com/numpy/numpy/blob/master/doc/source/conf.py#L215). > Example cross-link for convolve: > https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.convolve.html > > Ralf > > > >> >> >> On Wed, 25 Apr 2018 at 22:51 Ralf Gommers <ralf.gomm...@gmail.com> wrote: >> >>> On Wed, Apr 25, 2018 at 10:07 PM, Eric Wieser < >>> wieser.eric+nu...@gmail.com> wrote: >>> >>>> what does that gain over having the user do something like >>>> result.astype() >>>> >>>> It means that the user can use integer weights without worrying about >>>> losing precision due to an intermediate float representation. >>>> >>>> It also means they can use higher precision values (np.longdouble) or >>>> complex weights. >>>> >>> None of that seems particularly important to be honest. >>> >>> you’re emitting warnings for everyone >>>> >>>> When there’s a risk of precision loss, that seems like the responsible >>>> thing to do. >>>> >>> For precision loss of the order of float64 eps, I disagree. There will >>> be many such places in numpy and in other core libraries. >>> >>> >>>> Users passing float weights would see no warning, I suppose. >>>> >>>> is this really worth a new function >>>> >>>> There ought to be a function for computing histograms with integer >>>> weights that doesn’t lose precision. Either we change the existing function >>>> to do that, or we make a new function. >>>> >>> It's also possible to refer users to >>> scipy.stats.binned_statistic(_2d/dd), which provides a superset of the >>> histogram functionality and is internally consistent because the >>> implementations of 1d/2d call the dd one. >>> >>> Ralf >>> >>> >>> >>>> A possible compromise: like 1, but only change the dtype of the result >>>> if a weights argument is passed. >>>> >>>> #10864 <https://github.com/numpy/numpy/issues/10864> seems like a >>>> worrying design flaw too, but I suppose that can be dealt with separately. >>>> >>>> Eric >>>> >>>> >>>> On Wed, 25 Apr 2018 at 21:57 Ralf Gommers <ralf.gomm...@gmail.com> >>>> wrote: >>>> >>>>> On Mon, Apr 9, 2018 at 10:24 PM, Eric Wieser < >>>>> wieser.eric+nu...@gmail.com> wrote: >>>>> >>>>>> Numpy has three histogram functions - histogram, histogram2d, and >>>>>> histogramdd. >>>>>> >>>>>> histogram is by far the most widely used, and in the absence of >>>>>> weights and normalization, returns an np.intp count for each bin. >>>>>> >>>>>> histogramdd (for which histogram2d is a wrapper) returns np.float64 >>>>>> in all circumstances. >>>>>> >>>>>> As a contrived comparison >>>>>> >>>>>> >>> x = np.linspace(0, 1)>>> h, e = np.histogram(x*x, bins=4); h >>>>>> array([25, 10, 8, 7], dtype=int64)>>> h, e = np.histogramdd((x*x,), >>>>>> bins=4); h >>>>>> array([25., 10., 8., 7.]) >>>>>> >>>>>> https://github.com/numpy/numpy/issues/7845 tracks this inconsistency. >>>>>> >>>>>> The fix is now trivial: the question is, will changing the return >>>>>> type break people’s code? >>>>>> >>>>>> Either we should: >>>>>> >>>>>> 1. Just change it, and hope no one is broken by it >>>>>> 2. Add a dtype argument: >>>>>> - If dtype=None, behave like np.histogram >>>>>> - If dtype is not specified, emit a future warning >>>>>> recommending to use dtype=None or dtype=float >>>>>> - In future, change the default to None >>>>>> 3. Create a new better-named function histogram_nd, which can >>>>>> also be created without the mistake that is >>>>>> https://github.com/numpy/numpy/issues/10864. >>>>>> >>>>>> Thoughts? >>>>>> >>>>> >>>>> (1) sems like a no-go, taking such risks isn't justified by a minor >>>>> inconsistency. >>>>> >>>>> (2) is still fairly intrusive, you're emitting warnings for everyone >>>>> and still force people to change their code (and if they don't they may >>>>> run >>>>> into a backwards compat break). >>>>> >>>>> (3) is the best of these options, however is this really worth a new >>>>> function? My vote would be "do nothing". >>>>> >>>>> Ralf >>>>> >>>>> _______________________________________________ >>>>> NumPy-Discussion mailing list >>>>> NumPy-Discussion@python.org >>>>> https://mail.python.org/mailman/listinfo/numpy-discussion >>>>> >>>> >>>> _______________________________________________ >>>> NumPy-Discussion mailing list >>>> NumPy-Discussion@python.org >>>> https://mail.python.org/mailman/listinfo/numpy-discussion >>>> >>>> _______________________________________________ >>> NumPy-Discussion mailing list >>> NumPy-Discussion@python.org >>> https://mail.python.org/mailman/listinfo/numpy-discussion >>> >> >> _______________________________________________ >> NumPy-Discussion mailing list >> NumPy-Discussion@python.org >> https://mail.python.org/mailman/listinfo/numpy-discussion >> >> _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion@python.org > https://mail.python.org/mailman/listinfo/numpy-discussion >
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion