Can’t arange and linspace operations with floats be done internally Yes, and they probably should be - they’re done this way as a hack because the api exposed for custom dtypes is here <https://github.com/numpy/numpy/blob/81e15e812574d956fcc304c3982e2b59aa18aafb/numpy/core/include/numpy/ndarraytypes.h#L507-L511>, (example implementation here <https://github.com/numpy/numpy/blob/81e15e812574d956fcc304c3982e2b59aa18aafb/numpy/core/src/multiarray/arraytypes.c.src#L3711-L3721>) - essentially, you give it the first two elements of the array, and ask it to fill in the rest.
On Fri, 9 Feb 2018 at 13:17 Matthew Harrigan <harrigan.matt...@gmail.com> wrote: > I apologize if I'm missing something basic, but why are floats being > accumulated in the first place? Can't arange and linspace operations with > floats be done internally similar to `start + np.arange(num_steps) * > step_size`? I.e. always accumulate (really increment) integers to limit > errors. > > On Fri, Feb 9, 2018 at 3:43 PM, Benjamin Root <ben.v.r...@gmail.com> > wrote: > >> >> >> On Fri, Feb 9, 2018 at 12:19 PM, Chris Barker <chris.bar...@noaa.gov> >> wrote: >> >>> On Wed, Feb 7, 2018 at 12:09 AM, Ralf Gommers <ralf.gomm...@gmail.com> >>> wrote: >>>> >>>> It is partly a plea for some development of numerically accurate >>>>> functions for computing lat/lon grids from a combination of inputs: >>>>> bounds, >>>>> counts, and resolutions. >>>>> >>>> >>> Can you be more specific about what problems you've run into -- I work >>> with lat-lon grids all the time, and have never had a problem. >>> >>> float32 degrees gives you about 1 meter accuracy or better, so I can see >>> how losing a few digits might be an issue, though I would argue that you >>> maybe shouldn't use float32 if you are worried about anything close to 1m >>> accuracy... -- or shift to a relative coordinate system of some sort. >>> >> >> The issue isn't so much the accuracy of the coordinates themselves. I am >> only worried about 1km resolution (which is approximately 0.01 degrees at >> mid-latitudes). My concern is with consistent *construction* of a >> coordinate grid with even spacing. As it stands right now. If I provide a >> corner coordinate, a resolution, and the number of pixels, the result is >> not terrible (indeed, this is the approach used by gdal/rasterio). If I >> have start/end coordinates and the number of pixels, the result is not bad, >> either (use linspace). But, if I have start/end coordinates and a >> resolution, then determining the number of pixels from that is actually >> tricky to get right in the general case, especially with float32 and large >> grids, and especially if the bounding box specified isn't exactly divisible >> by the resolution. >> >> >>> >>> I have been playing around with the decimal package a bit lately, >>>>> >>>> >>> sigh. decimal is so often looked at a solution to a problem it isn't >>> designed for. lat-lon is natively Sexagesimal -- maybe we need that dtype >>> :-) >>> >>> what you get from decimal is variable precision -- maybe a binary >>> variable precision lib is a better answer -- that would be a good thing to >>> have easy access to in numpy, but in this case, if you want better accuracy >>> in a computation that will end up in float32, just use float64. >>> >> >> I am not concerned about computing distances or anything like that, I am >> trying to properly construct my grid. I need consistent results regardless >> of which way the grid is specified (start/end/count, start/res/count, >> start/end/res). I have found that loading up the grid specs (using in a >> config file or command-line) using the Decimal class allows me to exactly >> and consistently represent the grid specification, and gets me most of the >> way there. But the problems with arange() is frustrating, and I have to >> have extra logic to go around that and over to linspace() instead. >> >> >>> >>> and I discovered the concept of "fused multiply-add" operations for >>>>> improved accuracy. I have come to realize that fma operations could be >>>>> used >>>>> to greatly improve the accuracy of linspace() and arange(). >>>>> >>>> >>> arange() is problematic for non-integer use anyway, by its very >>> definition (getting the "end point" correct requires the right step, even >>> without FP error). >>> >>> and would it really help with linspace? it's computing a delta with one >>> division in fp, then multiplying it by an integer (represented in fp -- >>> why? why not keep that an integer till the multiply?). >>> >> >> Sorry, that was a left-over from a previous draft of my email after I >> discovered that linspace's accuracy was on par with fma(). And while >> arange() has inherent problems, it can still be made better than it is now. >> In fact, I haven't investigated this, but I did recently discover some unit >> tests of mine started to fail after a numpy upgrade, and traced it back to >> a reduction in the accuracy of a usage of arange() with float32s. So, >> something got worse at some point, which means we could still get accuracy >> back if we can figure out what changed. >> >> >>> >>> In particular, I have been needing improved results for computing >>>>> latitude/longitude grids, which tend to be done in float32's to save >>>>> memory >>>>> (at least, this is true in data I come across). >>>>> >>>> >>>> If you care about saving memory *and* accuracy, wouldn't it make more >>>> sense to do your computations in float64, and convert to float32 at the >>>> end? >>>> >>> >>> that does seem to be the easy option :-) >>> >> >> Kinda missing the point, isn't it? Isn't that like saying "convert all >> your data to float64s prior to calling np.mean()"? That's ridiculous. >> Instead, we made np.mean() upcast the inner-loop operation, and even allow >> an option to specify what the dtype that should be used for the aggregator. >> >> >>> >>> >>>> Now, to the crux of my problem. It is next to impossible to generate a >>>>> non-trivial numpy array of coordinates, even in double precision, without >>>>> hitting significant numerical errors. >>>>> >>>> >>> I'm confused, the example you posted doesn't have significant errors... >>> >> >> Hmm, "errors" was the wrong word. "Differences between methods" might be >> more along the lines of what I was thinking. Remember, I am looking for >> consistency. >> >> >>> >>> >>>> Which has lead me down the path of using the decimal package (which >>>>> doesn't play very nicely with numpy because of the lack of casting rules >>>>> for it). Consider the following: >>>>> ``` >>>>> $ cat test_fma.py >>>>> from __future__ import print_function >>>>> import numpy as np >>>>> res = np.float32(0.01) >>>>> cnt = 7001 >>>>> x0 = np.float32(-115.0) >>>>> x1 = res * cnt + x0 >>>>> print("res * cnt + x0 = %.16f" % x1) >>>>> x = np.arange(-115.0, -44.99 + (res / 2), 0.01, dtype='float32') >>>>> print("len(arange()): %d arange()[-1]: %16f" % (len(x), x[-1])) >>>>> x = np.linspace(-115.0, -44.99, cnt, dtype='float32') >>>>> print("linspace()[-1]: %.16f" % x[-1]) >>>>> >>>>> $ python test_fma.py >>>>> res * cnt + x0 = -44.9900015648454428 >>>>> len(arange()): 7002 arange()[-1]: -44.975044 >>>>> linspace()[-1]: -44.9900016784667969 >>>>> ``` >>>>> arange just produces silly results (puts out an extra element... >>>>> adding half of the resolution is typically mentioned as a solution on >>>>> mailing lists to get around arange()'s limitations -- I personally don't >>>>> do >>>>> this). >>>>> >>>> >>> The real solution is "don't do that" arange is not the right tool for >>> the job. >>> >> >> Well, it isn't the right tool because as far as I am concerned, it is >> useless for anything but integers. Why not fix it to be more suitable for >> floating point? >> >> >>> >>> Then there is this: >>> >>> res * cnt + x0 = -44.9900015648454428 >>> linspace()[-1]: -44.9900016784667969 >>> >>> that's as good as you are ever going to get with 32 bit floats... >>> >> >> Consistency is the key thing. I am fine with one of those values, so long >> as that value is what happens no matter which way I specify my grid. >> >> >>> >>> Though I just noticed something about your numbers -- there should be a >>> nice even base ten delta if you have 7001 gaps -- but linspace produces N >>> points, not N gaps -- so maybe you want: >>> >>> >>> In [*17*]: l = np.linspace(-115.0, -44.99, 7002) >>> >>> >>> In [*18*]: l[:5] >>> >>> Out[*18*]: array([-115. , -114.99, -114.98, -114.97, -114.96]) >>> >>> >>> In [*19*]: l[-5:] >>> >>> Out[*19*]: array([-45.03, -45.02, -45.01, -45. , -44.99]) >>> >>> >>> or, in float32 -- not as pretty: >>> >>> >>> In [*20*]: l = np.linspace(-115.0, -44.99, 7002, dtype=np.float32) >>> >>> >>> In [*21*]: l[:5] >>> >>> Out[*21*]: >>> >>> array([-115. , -114.98999786, -114.98000336, -114.97000122, >>> >>> -114.95999908], dtype=float32) >>> >>> >>> In [*22*]: l[-5:] >>> >>> Out[*22*]: array([-45.02999878, -45.02000046, -45.00999832, -45. , >>> -44.99000168], dtype=float32) >>> >>> >>> but still as good as you get with float32, and exactly the same result >>> as computing in float64 and converting: >>> >>> >>> >>> In [*25*]: l = np.linspace(-115.0, -44.99, 7002).astype(np.float32) >>> >>> >>> In [*26*]: l[:5] >>> >>> Out[*26*]: >>> >>> array([-115. , -114.98999786, -114.98000336, -114.97000122, >>> >>> -114.95999908], dtype=float32) >>> >>> >>> In [*27*]: l[-5:] >>> >>> Out[*27*]: array([-45.02999878, -45.02000046, -45.00999832, -45. , >>> -44.99000168], dtype=float32) >>> >> >> Argh! I got myself mixed up between specifying pixel corners versus pixel >> centers. rasterio has been messing me up on this. >> >> >>> >>> >>>>> So, does it make any sense to improve arange by utilizing fma() under >>>>> the hood? >>>>> >>>> >>> no -- this is simply not the right use-case for arange() anyway. >>> >> >> arange() has accuracy problems, so why not fix it? >> >> >>> l4 = np.arange(-115, -44.99, 0.01, dtype=np.float32) >> >>> np.median(np.diff(l4)) >> 0.0099945068 >> >>> np.float32(0.01) >> 0.0099999998 >> >> There is something significantly wrong here if arange(), which takes a >> resolution parameter, can't seem to produce a sequence with the proper >> delta. >> >> >> >>> >>> >>>> Also, any plans for making fma() available as a ufunc? >>>>> >>>> >>> could be nice -- especially if used internally. >>> >>> >>>> Notice that most of my examples required knowing the number of grid >>>>> points ahead of time. But what if I didn't know that? What if I just have >>>>> the bounds and the resolution? Then arange() is the natural fit, but as I >>>>> showed, its accuracy is lacking, and you have to do some sort of hack to >>>>> do >>>>> a closed interval. >>>>> >>>> >>> no -- it's not -- if you have the bounds and the resolution, you have an >>> over-specified problem. That is: >>> >>> x_min + (n * delta_x) == x_max >>> >>> If there is ANY error in either delta_x or x_max (or x_min), then you'll >>> get a missmatch. which is why arange is not the answer (you can make the >>> algorithm a bit more accurate, I suppose but there is still fp limited >>> precision -- if you can't exactly represent either delta_x or x_max, then >>> you CAN'T use the arange() definition and expect to work consistently. >>> >>> The "right" way to do it is to compute N with: round((x_max - x_min) / >>> delta), and then use linspace: >>> >>> linspace(x_min, x_max, N+1) >>> >>> (note that it's too bad you need to do N+1 -- if I had to do it over >>> again, I'd use N as the number of "gaps" rather than the number of points >>> -- that's more commonly what people want, if they care at all) >>> >>> This way, you get a grid with the endpoints as exact as they can be, and >>> the deltas as close to each-other as they can be as well. >>> >>> maybe you can do a better algorithm in linspace to save an ULP, but it's >>> hard to imagine when that would matter. >>> >> >> Yes, it is overspecified. My problem is that different tools require >> different specs (ahem... rasterio/gdal), and I have gird specs coming from >> other sources. And I need to produce data onto the same grid so that tools >> like xarray won't yell at me when I am trying to do an operation between >> gridded data that should have the same coordinates, but are off slightly >> because they were computed differently for whatever reason. >> >> I guess I am crying out for some sort of tool that will help the >> community stop making the same mistakes. A one-stop shop that'll allow us >> to specify a grid in a few different ways and still produce the right >> thing, and even do the inverse... provide a coordinate array and get grids >> specs in whatever form we want. Maybe even have options for dealing with >> pixel corner vs. pixel centers, too? There are additional fun problems such >> as padding out coordinate arrays, which np.pad doesn't really do a great >> job with. >> >> Cheers! >> Ben Root >> >> _______________________________________________ >> NumPy-Discussion mailing list >> NumPy-Discussion@python.org >> https://mail.python.org/mailman/listinfo/numpy-discussion >> >> > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion@python.org > https://mail.python.org/mailman/listinfo/numpy-discussion >
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion