Hamish: > > I am looking to add OpenMP support to v.surf.bspline ... Sören wrote: > Try r49406. > You need to separate the computation for j == 0 and j > 0.
nice, thanks. > > b) 3.5x speedup is very nice, but any way to improve > > on that 40% efficiency loss? > > The speedup is better as larger the band matrix is. But limiting > factor of parallel processing speedup is the the first computation for > j == 0. This operation must be done before the rest can be processed > in parallel. The next limiting factor is the time which the OS needs > to create a thread, unless the OpenMP implementation uses a thread > pool ... . I guess that getting a thread pool typically involves waiting until the next OS upgrade, or longer? > > c) for the many raster modules that do > > for (row = 0; row < nrows; row++) { > > for (col = 0; col < ncols; col++) > { > > > > I'm guessing it is better to multithread the columns for loop > > (filling a row's array in parallel), but to keep writing the raster > > rows serially. > > Yes, much better indeed. ... but the parallelizing the row loop would be much better if the whole array was in memory, or mmap()'d? > > But does the cost of creating and destroying 2-16 threads per row > > end up costing too much in terms of create/destroy overhead? > > IIUC "schedule (static)" helps to minimize the cost of creating > > each thread a little? (??) > > This is only useful in case the processing of the columns > needs much more time than the thread creation by the OS, unless a > thread pool .... . under the current method of parallelizing the inside loop though, say for a quad-core CPU with a 1200 cols x 800 rows array, we get 4 threads per row, each handling 300 columns, and for the task have created and destroyed 4*800= 3200 threads on a system which will only handle 4 at a time. much better (but perhaps harder) would be to parallelize as close to the main process level as we can, and then only deal with the overhead of creating/destroying e.g. 4 threads not 3200. On the otherhand, for OpenCL (I'll work on support for that after the OpenMP stuff has been committed) a modern GPU may well have 500 cores. in the case of v.surf.bspline I note it runs using 4-16 subregions for the tests runs I did. if those could each be sent to their own thread I think we'd be done (for a few years), without the 40% efficiency loss. If so, is it then possible to call omp_set_num_threads(1); to tell gmath lib not to try and parallelize it any more? The fn descr says "number of threads used by default in subsequent parallel sections", so maybe so. > Multithreading, especially in case of OpenMP reduction, is only > meaningful in case the data is large enough, otherwise the serial > gathering of n and the thread creation takes much longer then the > computation, unless a thread pool ..... . And even moreso for OpenCL, as copying the data into and the result back out of video card memory is very very slow. > > f) we talked a little about this before, but it would > > be good to settle on a uniform name for test suite scripts ... also it would be good to confirm a standard dataset to use. Generating fake data on the fly is self-boot strapping, but requires passing fixed seeds to v.random etc. Otherwise N.C.2008 probably gives a wider range of possibilities than the much smaller spearfish. (mainly that spearfish doesn't include lidar data) any thoughts? > > g) v.surf.rst spends 58% of its time in gmath lib's G_ludcmp() and 20% > > of its time in __iee754_log(). G_ludcmp() also looks like very low > > hanging fruit. (great!) it also looks like a very similar clone of other code in gmath lib, and I'd expect BLAS/LAPACK/ATLAS too. I was able to get v.surf.rst to run faster by putting some pragmas into G_ludcmp(), but again I wonder if it would be more efficient to concentrate on parallelizing the module's quadtree segments instead of the inner loops of the linear algebra. And again, maybe a two step approach: do the libs now (relatively easy), then later do the segments and have that module code also switch off threading for its library calls with omp_set_num_threads(). > > h) is it possible &/or desirable to use (aka outsource) pre-optimized > > & pre-threaded BLAS or LAPACK libraries for our linear algebra needs? > > The GRASS ATLAS wrapper is and example for such an approach. ATLAS can > be used, but in case it is not installed, the default GRASS > implementation is used. Oh, I did not know that was there. We can work on adding it to trunk's ./configure next. Hamish ps- I didn't add --with-openmp-includes= to ./configure in my patch, just --with-openmp-libs=. To use the omp_*() fns I guess omp.h is wanted, and so I should do that after all? _______________________________________________ grass-dev mailing list grass-dev@lists.osgeo.org http://lists.osgeo.org/mailman/listinfo/grass-dev