A Tuesday 26 May 2009 15:14:39 Andrew Friedley escrigué: > David Cournapeau wrote: > > Francesc Alted wrote: > >> Well, it is Andrew who should demonstrate that his measurement is > >> correct, but in principle, 4 cycles/item *should* be feasible when using > >> 8 cores in parallel. > > > > But the 100x speed increase is for one core only unless I misread the > > table. And I should have mentioned that 400 cycles/item for cos is on a > > pentium 4, which has dreadful performances (defective L1). On a much > > better core duo extreme something, I get 100 cycles / item (on a 64 bits > > machines, though, and not same compiler, although I guess the libm > > version is what matters the most here). > > > > And let's not forget that there is the python wrapping cost: by doing > > everything in C, I got ~ 200 cycle/cos on the PIV, and ~60 cycles/cos on > > the core 2 duo (for double), using the rdtsc performance counter. All > > this for 1024 items in the array, so very optimistic usecase (everything > > in cache 2 if not 1). > > > > This shows that python wrapping cost is not so high, making the 100x > > claim a bit doubtful without more details on the way to measure speed. > > I appreciate all the discussion this is creating. I wish I could work > on this more right now; I have a big paper deadline coming up June 1 > that I need to focus on. > > Yes, you're reading the table right. I should have been more clear on > what my implementation is doing. It's using SIMD, so performing 4 > cosine's at a time where a libm cosine is only doing one. Also I don't > think libm trancendentals are known for being fast; I'm also likely > gaining performance by using a well-optimized but less accurate > approximation. In fact a little more inspection shows my accuracy > decreases as the input values increase; I will probably need to take a > performance hit to fix this. > > I went and wrote code to use the libm fcos() routine instead of my cos > code. Performance is equivalent to numpy, plus an overhead: > > inp sizes 1024 10240 102400 1024000 3072000 > numpy 0.7282 9.6278 115.5976 993.5738 3017.3680 > > lmcos 1 0.7594 9.7579 116.7135 1039.5783 3156.8371 > lmcos 2 0.5274 5.7885 61.8052 537.8451 1576.2057 > lmcos 4 0.5172 5.1240 40.5018 313.2487 791.9730 > > corepy 1 0.0142 0.0880 0.9566 9.6162 28.4972 > corepy 2 0.0342 0.0754 0.6991 6.1647 15.3545 > corepy 4 0.0596 0.0963 0.5671 4.9499 13.8784 > > > The times I show are in milliseconds; the system used is a dual-socket > dual-core 2ghz opteron. I'm testing at the ufunc level, like this: > > def benchmark(fn, args): > avgtime = 0 > fn(*args) > > for i in xrange(7): > t1 = time.time() > fn(*args) > t2 = time.time() > > tm = t2 - t1 > avgtime += tm > > return avgtime / 7 > > Where fn is a ufunc, ie numpy.cos. So I prime the execution once, then > do 7 timings and take the average. I always appreciate suggestions on > better way to benchmark things.
No, that seems good enough. But maybe you can present results in cycles/item. This is a relatively common unit and has the advantage that it does not depend on the frequency of your cores. -- Francesc Alted _______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion