Hi!, El Tue, Jan 05, 2016 at 09:01:40PM -0800, Ganesh Ajjanagadde escribio: > On Tue, Jan 5, 2016 at 10:46 AM, Ganesh Ajjanagadde <gajja...@mit.edu> wrote: > > On Tue, Jan 5, 2016 at 10:10 AM, Daniel Serpell <dserp...@gmail.com> wrote: > >> Hi!, > >> > >> El Tue, Jan 05, 2016 at 08:08:35AM -0800, Ganesh Ajjanagadde escribio: > >>> On Tue, Jan 5, 2016 at 7:44 AM, Daniel Serpell <dserp...@gmail.com> wrote: > >>> > Hi!, > >>> > > >>> > El Mon, Jan 04, 2016 at 06:33:59PM -0800, Ganesh Ajjanagadde escribio: > >>> >> This exploits an approach based on the sieve of Eratosthenes, a popular > >>> >> method for generating prime numbers. > >>> >> > >>> >> Tables are identical to previous ones. > >>> >> > >>> >> Tested with FATE with/without --enable-hardcoded-tables. > >>> >> > >>> >> Sample benchmark (Haswell, GNU/Linux+gcc): > >>> >> prev: > >>> >> 7860100 decicycles in cbrt_tableinit, 1 runs, 0 skips > >>> >> 7777490 decicycles in cbrt_tableinit, 2 runs, 0 skips > >>> >> [...] > >>> >> 7582339 decicycles in cbrt_tableinit, 256 runs, 0 skips > >>> >> 7563556 decicycles in cbrt_tableinit, 512 runs, 0 skips > >>> >> > >>> >> new: > >>> >> 2099480 decicycles in cbrt_tableinit, 1 runs, 0 skips > >>> >> 2044470 decicycles in cbrt_tableinit, 2 runs, 0 skips > >>> >> [...] > >>> >> 1796544 decicycles in cbrt_tableinit, 256 runs, 0 skips > >>> >> 1791631 decicycles in cbrt_tableinit, 512 runs, 0 skips > >>> >> > >>> > > >>> > See attached code, function "test1", based on an approximation of: > >>> > > >>> > (i+1)^(1/3) ~= i^(1/3) * ( 1 + 1/(3i) - 1/(9i) + 5/(81i) - .... ) > [...] > > So I looked up Mr. Fog's manuals, and it seems like divides are > considerably slower than multiplies and adds. This made me somewhat > skeptical of your idea, and so I benched it. > Seems like your patch is actually significantly slower on my setup > (gcc 5.3.0, GNU libm): > 3349080 decicycles in cbrt_tableinit, 1 runs, 0 skips > 3466605 decicycles in cbrt_tableinit, 2 runs, 0 skips > [...] > 3425939 decicycles in cbrt_tableinit, 256 runs, 0 skips > 3414528 decicycles in cbrt_tableinit, 512 runs, 0 skips > > (clang 3.7.0): > 3337590 decicycles in cbrt_tableinit, 1 runs, 0 skips > 3276225 decicycles in cbrt_tableinit, 2 runs, 0 skips > [...] > 2871065 decicycles in cbrt_tableinit, 256 runs, 0 skips > 2832137 decicycles in cbrt_tableinit, 512 runs, 0 skips > > The divides seem to be what is killing your approach. > Times (just for the divisions, clang): > 1085430 decicycles in cbrt_tableinit, 1 runs, 0 skips > 1005165 decicycles in cbrt_tableinit, 2 runs, 0 skips > [...] > 863732 decicycles in cbrt_tableinit, 256 runs, 0 skips > 864907 decicycles in cbrt_tableinit, 512 runs, 0 skips > > Another illustration, even if I change the 1.0/(3*i) to simply i to > get rid of the multiplication as well, obviously not possible but > really wishful thinking, you still only get: > (clang) > 2013390 decicycles in cbrt_tableinit, 1 runs, 0 skips > 1950135 decicycles in cbrt_tableinit, 2 runs, 0 skips > [...] > 1668963 decicycles in cbrt_tableinit, 256 runs, 0 skips > 1653179 decicycles in cbrt_tableinit, 512 runs, 0 skips > > To complete my curiousity, > (gcc) > 1485240 decicycles in cbrt_tableinit, 1 runs, 0 skips > 1522115 decicycles in cbrt_tableinit, 2 runs, 0 skips > [...] > 1324325 decicycles in cbrt_tableinit, 256 runs, 0 skips > 1322110 decicycles in cbrt_tableinit, 512 runs, 0 skips > > i.e not a whole lot faster than the benchmark I posted. > > If you feel this can't be right, I can do give a higher level > justification, which shows that for a reasonable libm cbrt, and > standard architectural assumptions, the approach I have is faster. >
Thanks for your tests. In my PC I measured a higher speedup, perhaps you can replace the main loop with: cb = 7; for(i=343; i<8192; i++) { double s, r, t; out[i] = cb * i; // For big values, we use 4 terms: r = (1.0/3.0)/i; // 0 A , 1 M , 1 D t = r; // s = 1.0 + r; // 1 A , 1 M , 1 D r = r * r; // 1 A , 2 M , 1 D s = s - r; // 2 A , 2 M , 1 D r = r * t; // 2 A , 3 M , 1 D s = s + 1.6644 *r; // 3 A , 4 M , 1 D cb = cb * s; // 3 A , 5 M , 1 D } I know that divisions are considerably slower than multiplications, but at least in my PC, the following is slower still (based on newton's method): // "cb" is the cube-root approximation to "1/i²" cb = 1.0/49.0; for(i=343; i<8192; i++) { double s = cb*cb*i; cb = (1.0/3.0) * (4*cb - s*s); s = cb*cb*i; cb = (1.0/3.0) * (4*cb - s*s); out[i] = cb * i * i; } Perhaps in your PC is faster? Thanks, Daniel. _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel