On Tue, Feb 15, 2011 at 11:33 AM, Sebastian Haase <seb.ha...@gmail.com> wrote: > Wes, > I think I should have a couple of GPUs. I would be ready for anything > ... if you think that I could do some easy(!) CUDA programming here, > maybe you could guide me into the right direction... > Thanks, > Sebastian. > > > On Tue, Feb 15, 2011 at 5:26 PM, Wes McKinney <wesmck...@gmail.com> wrote: >> On Tue, Feb 15, 2011 at 11:25 AM, Matthieu Brucher >> <matthieu.bruc...@gmail.com> wrote: >>> Use directly restrict in C99 mode (__restrict does not have exactly the same >>> semantics). >>> For a valgrind profil, you can check my blog >>> (http://matt.eifelle.com/2009/04/07/profiling-with-valgrind/) >>> Basically, if you have a python script, you can valgrind --optionsinmyblog >>> python myscript.py >>> For PAPI, you have to install several packages (perf module for kernel for >>> instance) and a GUI to analyze the results (in Eclispe, it should be >>> possible). >>> Matthieu >>> 2011/2/15 Sebastian Haase <seb.ha...@gmail.com> >>>> >>>> Thanks Matthieu, >>>> using __restrict__ with g++ did not change anything. How do I use >>>> valgrind with C extensions? >>>> I don't know what "PAPI profil" is ...? >>>> -Sebastian >>>> >>>> >>>> On Tue, Feb 15, 2011 at 4:54 PM, Matthieu Brucher >>>> <matthieu.bruc...@gmail.com> wrote: >>>> > Hi, >>>> > My first move would be to add a restrict keyword to dist (i.e. dist is >>>> > the >>>> > only pointer to the specific memory location), and then declare dist_ >>>> > inside >>>> > the first loop also with a restrict. >>>> > Then, I would run valgrind or a PAPI profil on your code to see what >>>> > causes >>>> > the issue (false sharing, ...) >>>> > Matthieu >>>> > >>>> > 2011/2/15 Sebastian Haase <seb.ha...@gmail.com> >>>> >> >>>> >> Hi, >>>> >> I assume that someone here could maybe help me, and I'm hoping it's >>>> >> not too much off topic. >>>> >> I have 2 arrays of 2d point coordinates and would like to calculate >>>> >> all pairwise distances as fast as possible. >>>> >> Going from Python/Numpy to a (Swigged) C extension already gave me a >>>> >> 55x speedup. >>>> >> (.9ms vs. 50ms for arrays of length 329 and 340). >>>> >> I'm using gcc on Linux. >>>> >> Now I'm wondering if I could go even faster !? >>>> >> My hope that the compiler might automagically do some SSE2 >>>> >> optimization got disappointed. >>>> >> Since I have a 4 core CPU I thought OpenMP might be an option; >>>> >> I never used that, and after some playing around I managed to get >>>> >> (only) 50% slowdown(!) :-( >>>> >> >>>> >> My code in short is this: >>>> >> (My SWIG typemaps use obj_to_array_no_conversion() from numpy.i) >>>> >> -------<Ccode> ---------- >>>> >> void dists2d( >>>> >> double *a_ps, int nx1, int na, >>>> >> double *b_ps, int nx2, int nb, >>>> >> double *dist, int nx3, int ny3) throw (char*) >>>> >> { >>>> >> if(nx1 != 2) throw (char*) "a must be of shape (n,2)"; >>>> >> if(nx2 != 2) throw (char*) "b must be of shape (n,2)"; >>>> >> if(nx3 != nb || ny3 != na) throw (char*) "c must be of shape >>>> >> (na,nb)"; >>>> >> >>>> >> double *dist_; >>>> >> int i, j; >>>> >> >>>> >> #pragma omp parallel private(dist_, j, i) >>>> >> { >>>> >> #pragma omp for nowait >>>> >> for(i=0;i<na;i++) >>>> >> { >>>> >> //num_threads=omp_get_num_threads(); --> 4 >>>> >> dist_ = dist+i*nb; // dists_ is only >>>> >> introduced for OpenMP >>>> >> for(j=0;j<nb;j++) >>>> >> { >>>> >> *dist_++ = sqrt( sq(a_ps[i*nx1] - >>>> >> b_ps[j*nx2]) + >>>> >> >>>> >> sq(a_ps[i*nx1+1] >>>> >> - b_ps[j*nx2+1]) ); >>>> >> } >>>> >> } >>>> >> } >>>> >> } >>>> >> -------</Ccode> ---------- >>>> >> There is probably a simple mistake in this code - as I said I never >>>> >> used OpenMP before. >>>> >> It should be not too difficult to use OpenMP correctly here >>>> >> or - maybe better - >>>> >> is there a simple SSE(2,3,4) version that might be even better than >>>> >> OpenMP... !? >>>> >> >>>> >> I supposed, that I did not get the #pragma omp lines right - any idea ? >>>> >> Or is it in general not possible to speed this kind of code up using >>>> >> OpenMP !? >>>> >> >>>> >> Thanks, >>>> >> Sebastian Haase >>>> >> _______________________________________________ >>>> >> NumPy-Discussion mailing list >>>> >> NumPy-Discussion@scipy.org >>>> >> http://mail.scipy.org/mailman/listinfo/numpy-discussion >>>> > >>>> > >>>> > >>>> > -- >>>> > Information System Engineer, Ph.D. >>>> > Blog: http://matt.eifelle.com >>>> > LinkedIn: http://www.linkedin.com/in/matthieubrucher >>>> > >>>> > _______________________________________________ >>>> > NumPy-Discussion mailing list >>>> > NumPy-Discussion@scipy.org >>>> > http://mail.scipy.org/mailman/listinfo/numpy-discussion >>>> > >>>> > >>>> _______________________________________________ >>>> NumPy-Discussion mailing list >>>> NumPy-Discussion@scipy.org >>>> http://mail.scipy.org/mailman/listinfo/numpy-discussion >>> >>> >>> >>> -- >>> Information System Engineer, Ph.D. >>> Blog: http://matt.eifelle.com >>> LinkedIn: http://www.linkedin.com/in/matthieubrucher >>> >>> _______________________________________________ >>> NumPy-Discussion mailing list >>> NumPy-Discussion@scipy.org >>> http://mail.scipy.org/mailman/listinfo/numpy-discussion >>> >>> >> >> As an aside, do you have a GPU-equipped machine? This function looks >> like it would be an easy CUDA target. Depends on who the users of the >> function are (e.g. do they also have the CUDA runtime) if whether you >> wanted to go that route. >> _______________________________________________ >> NumPy-Discussion mailing list >> NumPy-Discussion@scipy.org >> http://mail.scipy.org/mailman/listinfo/numpy-discussion >> > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion@scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion >
These days it's pretty low-hanging fruit-- but you need to learn a bit about the CUDA API and programming model through the user guide: http://developer.download.nvidia.com/compute/cuda/3_2_prod/toolkit/docs/CUDA_C_Programming_Guide.pdf I would say it's time well-invested. There is PyCUDA which makes the "boilerplate" parts of doing GPU computing a bit more pleasant (just write the kernel)-- I've been using Cython recently to interact with CUDA-based C code which is not a terrible option either (more work than PyCUDA). But if speed is your concern than the GPU will be *way* more bang for your buck than OpenMP/SSE2. I'd be surprised if you can't get at least 20x speedup over single-threaded C code even without tweaking (and maybe even a lot faster than that). But unfortunately with CUDA programming you spend a lot of your time tweaking memory transactions to get the best performance. _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion