David Cournapeau wrote: > Francesc Alted wrote: >> Well, it is Andrew who should demonstrate that his measurement is correct, >> but >> in principle, 4 cycles/item *should* be feasible when using 8 cores in >> parallel. > > But the 100x speed increase is for one core only unless I misread the > table. And I should have mentioned that 400 cycles/item for cos is on a > pentium 4, which has dreadful performances (defective L1). On a much > better core duo extreme something, I get 100 cycles / item (on a 64 bits > machines, though, and not same compiler, although I guess the libm > version is what matters the most here). > > And let's not forget that there is the python wrapping cost: by doing > everything in C, I got ~ 200 cycle/cos on the PIV, and ~60 cycles/cos on > the core 2 duo (for double), using the rdtsc performance counter. All > this for 1024 items in the array, so very optimistic usecase (everything > in cache 2 if not 1). > > This shows that python wrapping cost is not so high, making the 100x > claim a bit doubtful without more details on the way to measure speed.
I appreciate all the discussion this is creating. I wish I could work on this more right now; I have a big paper deadline coming up June 1 that I need to focus on. Yes, you're reading the table right. I should have been more clear on what my implementation is doing. It's using SIMD, so performing 4 cosine's at a time where a libm cosine is only doing one. Also I don't think libm trancendentals are known for being fast; I'm also likely gaining performance by using a well-optimized but less accurate approximation. In fact a little more inspection shows my accuracy decreases as the input values increase; I will probably need to take a performance hit to fix this. I went and wrote code to use the libm fcos() routine instead of my cos code. Performance is equivalent to numpy, plus an overhead: inp sizes 1024 10240 102400 1024000 3072000 numpy 0.7282 9.6278 115.5976 993.5738 3017.3680 lmcos 1 0.7594 9.7579 116.7135 1039.5783 3156.8371 lmcos 2 0.5274 5.7885 61.8052 537.8451 1576.2057 lmcos 4 0.5172 5.1240 40.5018 313.2487 791.9730 corepy 1 0.0142 0.0880 0.9566 9.6162 28.4972 corepy 2 0.0342 0.0754 0.6991 6.1647 15.3545 corepy 4 0.0596 0.0963 0.5671 4.9499 13.8784 The times I show are in milliseconds; the system used is a dual-socket dual-core 2ghz opteron. I'm testing at the ufunc level, like this: def benchmark(fn, args): avgtime = 0 fn(*args) for i in xrange(7): t1 = time.time() fn(*args) t2 = time.time() tm = t2 - t1 avgtime += tm return avgtime / 7 Where fn is a ufunc, ie numpy.cos. So I prime the execution once, then do 7 timings and take the average. I always appreciate suggestions on better way to benchmark things. Andrew _______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion