A Tuesday 26 May 2009 03:11:56 David Cournapeau escrigué: > Charles R Harris wrote: > > On Mon, May 25, 2009 at 4:59 AM, Andrew Friedley <afrie...@indiana.edu > > <mailto:afrie...@indiana.edu>> wrote: > > > > For some reason the list seems to occasionally drop my messages... > > > > Francesc Alted wrote: > > > A Friday 22 May 2009 13:52:46 Andrew Friedley escrigué: > > >> I'm the student doing the project. I have a blog here, which > > > > contains > > > > >> some initial performance numbers for a couple test ufuncs I did: > > >> > > >> http://numcorepy.blogspot.com > > >> > > >> Another alternative we've talked about, and I (more and more > > > > likely) may > > > > >> look into is composing multiple operations together into a > > > > single ufunc. > > > > >> Again the main idea being that memory accesses can be > > > > reduced/eliminated. > > > > > IMHO, composing multiple operations together is the most > > > > promising venue for > > > > > leveraging current multicore systems. > > > > Agreed -- our concern when considering for the project was to keep > > the scope reasonable so I can complete it in the GSoC timeframe. If I > > have > > time I'll definitely be looking into this over the summer; if not > > later. > > > > > Another interesting approach is to implement costly operations > > > > (from the point > > > > > of view of CPU resources), namely, transcendental functions like > > > > sin, cos or > > > > > tan, but also others like sqrt or pow) in a parallel way. If > > > > besides, you can > > > > > combine this with vectorized versions of them (by using the well > > > > spread SSE2 > > > > > instruction set, see [1] for an example), then you would be able > > > > to achieve > > > > > really good results for sure (at least Intel did with its VML > > > > library ;) > > > > > [1] http://gruntthepeon.free.fr/ssemath/ > > > > I've seen that page before. Using another source [1] I came up with > > a quick/dirty cos ufunc. Performance is crazy good compared to NumPy > > (100x); see the latest post on my blog for a little more info. I'll look > > at the source myself when I get time again, but is NumPy using a > > Python-based cos function, a C implementation, or something else? As I > > wrote in my blog, the performance gain is almost too good to believe. > > > > > > Numpy uses the C library version. If long double and float aren't > > available the double version is used with number conversions, but that > > shouldn't give a factor of 100x. Something else is going on. > > I think something is wrong with the measurement method - on my machine, > computing the cos of an array of double takes roughly ~400 cycles/item > for arrays with a reasonable size (> 1e3 items). Taking 4 cycles/item > for cos would be very impressive :)
Well, it is Andrew who should demonstrate that his measurement is correct, but in principle, 4 cycles/item *should* be feasible when using 8 cores in parallel. In [1] one can see how Intel achieves (with his VML kernel) to compute a cos() in less than 23 cycles in one single core. Having 8 cores in parallel would allow, in theory, reach 3 cycles/item. [1]http://www.intel.com/software/products/mkl/data/vml/functions/_performanceall.html -- Francesc Alted _______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion