[Help-gsl] gsl performance
Hi all, I am creating a personal library for my C++ projects and to evaluate its performance I decided to compare it to gsl and, to my surprise, my library is much faster than gsl in many cases. For example, my Nelder-Mead implementation can be 5 or 10 faster for some functions. This has been bothering me a lot because: 1) My tests are correct, and I checked the results against output from Scipy and Octave. 2) I am certainly not smarter than the many people who worked on gsl over the years. 3) I think I know how to use gsl properly. Sorry I do not have an example, but my library is not polished enough for me to publish it yet. What I am really intrigued about is that I googled around and it seems that many people noticed (before me) that many gsl implementations are inefficient. I also found some posts on the mailing lists with things like gsl is much slower than Matlab and responses like gsl is not meant to be the fastest. These things lead me to believe that gsl is slow by design, If this is the case, can someone explain to me why this is so? Gilberto PS: Much of the gain with C++ comes from better inlining, which can be specially important with things like function optimization (it is hard, if not impossible, to inline when there is a funcyion pointer passed around). This is why std::sort can be much faster than lqsort. However, I am confident that it is possible to write faster implementations (in C) than some of the ones in gsl.
Re: [Help-gsl] gsl performance
Hi Gilbaerto, congrats on your performance results! A first guess of the worse performance of the gsl library would be exception throwing. The GSL is made for 'users' and this means, that a user has to be informed about the kind of exception he encounters. This can be left out if you have your own code and you know your own code. If something goes wrong, you know where to look in your code where the error occurred and what could be the reasons. A 'user' just using a library could be helpless when he encounters and error and does not know where it occurred. So checking and error catching could be a reason, that makes the gsl less performant but more appropriate for 'easy' usage. I use a lot the C++ linear algebra library Armadillo. This library has for example two different element accessors: One, '()' which makes checkings and the second 'at()' with no checkings. Performance increases sometimes tremendously when using the 'at()' method. Best Simon On Oct 6, 2013, at 12:44 AM, onefire onefire.mys...@gmail.com wrote: Hi all, I am creating a personal library for my C++ projects and to evaluate its performance I decided to compare it to gsl and, to my surprise, my library is much faster than gsl in many cases. For example, my Nelder-Mead implementation can be 5 or 10 faster for some functions. This has been bothering me a lot because: 1) My tests are correct, and I checked the results against output from Scipy and Octave. 2) I am certainly not smarter than the many people who worked on gsl over the years. 3) I think I know how to use gsl properly. Sorry I do not have an example, but my library is not polished enough for me to publish it yet. What I am really intrigued about is that I googled around and it seems that many people noticed (before me) that many gsl implementations are inefficient. I also found some posts on the mailing lists with things like gsl is much slower than Matlab and responses like gsl is not meant to be the fastest. These things lead me to believe that gsl is slow by design, If this is the case, can someone explain to me why this is so? Gilberto PS: Much of the gain with C++ comes from better inlining, which can be specially important with things like function optimization (it is hard, if not impossible, to inline when there is a funcyion pointer passed around). This is why std::sort can be much faster than lqsort. However, I am confident that it is possible to write faster implementations (in C) than some of the ones in gsl.
Re: [Help-gsl] gsl performance
I am aware of those. I tried the following: 1) Call gsl_set_error_handler_off() to turn off the error handler 2) Compile with -DHAVE_INLINE and -DGSL_RANGE_CHECK_OFF 3) Link to a different cblas (actually I tried openblas, mkl and gslcblas). However, my most interesting experiment was to modify my minimization algorithms to accept dynamic arrays. Originally, my function was using std::array to store the values of the independent variables when minimizing a multidimensional function. My experiment did mimic my use case: solve the same problem millions of times (to be precise, a million times). When I used dynamic arrays, my algorithm became much slower because of all the calls to malloc/new, and free/delete. So I did modify my algorithm to accept a gsl_vector. Guess what happened? My library became slower than gsl! I am not sure if this is because my Nelder-Mead implementation is different than gsl's (which I did not look at) or if its just the extra overhead of making the wrapper. My point is: The huge differences that I observed were because of gsl_vector which is not efficient for small arrays (stack memory is faster and you avoid the costs to malloc/free). So I think that the gsl minimization algorithms could have a version that uses stack arrays. I should not have to pay for the cost of dynamic allocation if I am minimizing a function of 5 variables (a typical version of the Nelder-Mead algorithm would require 6 points, including the function evaluations that would require an array of 36 elements, which is fine for the stack) . I think that gsl could have alternative implementations that use stack memory (with warnings to not use those if the problem is big). The other issue is: the implementation of gsl_vector just seems inefficient to me. Looking at the code, it seems like a single vector requires 3 calls to malloc and free (one for the data, one for the gsl_block, and one for the gsl_vector itself). The manual states that the block is there for consistency, and I can see how memory management becomes easier with it. But it seems to be a case of generality at the expense of performance. Also, the stride and the owner flag are part of the gsl_vector object to make it work with gsl_views, but then people who never need views pay the performance price anyway. All of these issues are not likely to matter for big vectors, but they do make a large difference when you are dealing with small objects. Am I the only one who thinks like this? Gilberto PS: As a side note, I prefer Eigen over Armadillo. In my experience, the former is almost always faster. On Sun, Oct 6, 2013 at 5:09 AM, Simon Zehnder simon.zehn...@googlemail.comwrote: Hi Gilbaerto, congrats on your performance results! A first guess of the worse performance of the gsl library would be exception throwing. The GSL is made for 'users' and this means, that a user has to be informed about the kind of exception he encounters. This can be left out if you have your own code and you know your own code. If something goes wrong, you know where to look in your code where the error occurred and what could be the reasons. A 'user' just using a library could be helpless when he encounters and error and does not know where it occurred. So checking and error catching could be a reason, that makes the gsl less performant but more appropriate for 'easy' usage. I use a lot the C++ linear algebra library Armadillo. This library has for example two different element accessors: One, '()' which makes checkings and the second 'at()' with no checkings. Performance increases sometimes tremendously when using the 'at()' method. Best Simon On Oct 6, 2013, at 12:44 AM, onefire onefire.mys...@gmail.com wrote: Hi all, I am creating a personal library for my C++ projects and to evaluate its performance I decided to compare it to gsl and, to my surprise, my library is much faster than gsl in many cases. For example, my Nelder-Mead implementation can be 5 or 10 faster for some functions. This has been bothering me a lot because: 1) My tests are correct, and I checked the results against output from Scipy and Octave. 2) I am certainly not smarter than the many people who worked on gsl over the years. 3) I think I know how to use gsl properly. Sorry I do not have an example, but my library is not polished enough for me to publish it yet. What I am really intrigued about is that I googled around and it seems that many people noticed (before me) that many gsl implementations are inefficient. I also found some posts on the mailing lists with things like gsl is much slower than Matlab and responses like gsl is not meant to be the fastest. These things lead me to believe that gsl is slow by design, If this is the case, can someone explain to me why this is so? Gilberto PS: Much of the gain with C++ comes from better inlining, which can be specially important with things like function
Re: [Help-gsl] gsl performance
Rhys, I did try to use views. They do not help because the gsl routines allocate vectors internally and there is not much that I can do about it... except for maybe hacking gsl and changing gsl_vector_alloc myself. The main target of a library like GSL is an inherent consistency of its objects and as it defines the gsl_vector and uses it everywhere in its methods, consistency is reached. . That's what I thought too. That would be more consistent with my initial question about gsl design not being about performance. I heard about nlopt before and I plan to benchmark my results against them as well. I started with gsl because I am more familiar with it. On Sun, Oct 6, 2013 at 4:38 PM, Simon Zehnder simon.zehn...@googlemail.comwrote: I write between the lines. On Oct 6, 2013, at 10:04 PM, onefire onefire.mys...@gmail.com wrote: I am aware of those. I tried the following: 1) Call gsl_set_error_handler_off() to turn off the error handler 2) Compile with -DHAVE_INLINE and -DGSL_RANGE_CHECK_OFF 3) Link to a different cblas (actually I tried openblas, mkl and gslcblas). However, my most interesting experiment was to modify my minimization algorithms to accept dynamic arrays. Originally, my function was using std::array to store the values of the independent variables when minimizing a multidimensional function. My experiment did mimic my use case: solve the same problem millions of times (to be precise, a million times). When I used dynamic arrays, my algorithm became much slower because of all the calls to malloc/new, and free/delete. So I did modify my algorithm to accept a gsl_vector. Guess what happened? My library became slower than gsl! I am not sure if this is because my Nelder-Mead implementation is different than gsl's (which I did not look at) or if its just the extra overhead of making the wrapper. Here a profiling could give clearer insights - if its possible - about the gsl functions that take the most time. My point is: The huge differences that I observed were because of gsl_vector which is not efficient for small arrays (stack memory is faster and you avoid the costs to malloc/free). So I think that the gsl minimization algorithms could have a version that uses stack arrays. I should not have to pay for the cost of dynamic allocation if I am minimizing a function of 5 variables (a typical version of the Nelder-Mead algorithm would require 6 points, including the function evaluations that would require an array of 36 elements, which is fine for the stack) . I think that gsl could have alternative implementations that use stack memory (with warnings to not use those if the problem is big). Interesting! Indeed in most statistical problems I am involved in around 6 parameters is also a good average. Physicists and Engineers may have larger parameter vectors. The main target of a library like GSL is an inherent consistency of its objects and as it defines the gsl_vector and uses it everywhere in its methods, consistency is reached. Performance is something that is not its main target I would guess. Providing a function though, that works with STL objects - commonly used by C++ developers is not a bad idea and - as you tell - it improves performance. Especially embedding GSL into self-developed code becomes easier with such a solution, as the step from own code to GSL-methods would be more 'fluently'. For another performance benchmark you could try the Nelder-Mead from nlopt. This works with std::vectors. The other issue is: the implementation of gsl_vector just seems inefficient to me. Looking at the code, it seems like a single vector requires 3 calls to malloc and free (one for the data, one for the gsl_block, and one for the gsl_vector itself). The manual states that the block is there for consistency, and I can see how memory management becomes easier with it. But it seems to be a case of generality at the expense of performance. Also, the stride and the owner flag are part of the gsl_vector object to make it work with gsl_views, but then people who never need views pay the performance price anyway. Again, I would assume the main developing target was inherent consistency. All of these issues are not likely to matter for big vectors, but they do make a large difference when you are dealing with small objects. Am I the only one who thinks like this? Gilberto PS: As a side note, I prefer Eigen over Armadillo. In my experience, the former is almost always faster. Thanks for the suggestion, I heard about it but never tried it. I will give it a shot. On Sun, Oct 6, 2013 at 5:09 AM, Simon Zehnder simon.zehn...@googlemail.com wrote: Hi Gilbaerto, congrats on your performance results! A first guess of the worse performance of the gsl library would be exception throwing. The GSL is made for 'users' and this means, that a user has to be informed about the kind of exception he encounters. This
[Help-gsl] Possible leak in gsl_multifit_fdfsolver.
Hi. I've been struggling with this problem for a few weeks now, but I do not have enough experience with debugging to be sure about what is written below. I'll quote some simple code, which I believe you know well: 51: s = gsl_multifit_fdfsolver_alloc (T, n, p); 52: gsl_multifit_fdfsolver_set (s, f, x.vector); 56: do { 57: iter++; 58: status = gsl_multifit_fdfsolver_iterate (s); 64: if (status) 65: break; 67: status = gsl_multifit_test_delta (s-dx, s-x, 68: 1e-3, 1e-3); 69: } 70: while (status == GSL_CONTINUE iter 500); 72: gsl_multifit_covar (s-J, 0.0, covar); 77: double chi = gsl_blas_dnrm2(s-f); 89: for (int nN = 0; nN nNumParameters; nN++) { 90:padParameters[nN] = gsl_vector_get(s-x, nN); 91: } 93: gsl_multifit_fdfsolver_free (s); 94: gsl_matrix_free (covar); Where, from gdb: (gdb) p s $68 = (gsl_multifit_fdfsolver *) 0x10080f270 (gdb) p *s $69 = { type = 0x1001b6de0, fdf = 0x7fff5fbfdbc0, x = 0x100810480, f = 0x10080f6c0, J = 0x10080f740, dx = 0x100810a00, state = 0x100810cb0 } (gdb) whatis s-state type = void * Now, if I step through this code, keeping an eye on address 0x100810cc0: 51 s = gsl_multifit_fdfsolver_alloc (T, n, p); (gdb) x 0x100810cc0 0x100810cc0: 0x (gdb) s 52 gsl_multifit_fdfsolver_set (s, f, x.vector); (gdb) x 0x100810cc0 0x100810cc0: 0x (gdb) s 57 iter++; (gdb) x 0x100810cc0 0x100810cc0: 0x052e3967 (gdb) s 58 status = gsl_multifit_fdfsolver_iterate (s); (gdb) x 0x100810cc0 0x100810cc0: 0x052e3967 (gdb) s 64 if (status) (gdb) x 0x100810cc0 0x100810cc0: 0xa697930c (gdb) s 67 status = gsl_multifit_test_delta (s-dx, s-x, (gdb) x 0x100810cc0 0x100810cc0: 0xa697930c (gdb) s 69 } (gdb) x 0x100810cc0 0x100810cc0: 0xa697930c (gdb) s 57 iter++; (gdb) x 0x100810cc0 0x100810cc0: 0xa697930c (gdb) s 58 status = gsl_multifit_fdfsolver_iterate (s); (gdb) x 0x100810cc0 0x100810cc0: 0xa697930c (gdb) s 64 if (status) (gdb) x 0x100810cc0 0x100810cc0: 0x2d9931ef (gdb) s 67 status = gsl_multifit_test_delta (s-dx, s-x, (gdb) b 92 Breakpoint 11 at 0x1b774: file Fit.cpp, line 92. (gdb) c Continuing. ... Breakpoint 11, Fit::FitData (this=0x100810120, cXData=0x7fff5fbfded8, cYData=0x7fff5fbfdea0, cYStdDevData=0x7fff5fbfde68, dStdDev=0x7fff5fbfdf60, padParameters=0x100810130, nNumParameters=6) at Fit.cpp:93 93 gsl_multifit_fdfsolver_free (s); (gdb) x 0x100810cc0 0x100810cc0: 0x0fb668f4 (gdb) s 94 gsl_matrix_free (covar); (gdb) x 0x100810cc0 0x100810cc0: 0x0fb600bb (gdb) s 96 *dStdDev = chi; (gdb) x 0x100810cc0 0x100810cc0: 0x0fb600bb This is a completely reproductive error. The address 0x100810cc0 has a null value in all the other areas of the code. It assumes a non null value only after line 52 and changes its value every time status is accessed. In the end, when the gsl_multifit_fdfsolver should be freed, some trash is left in that address and I get the following, intermittent, extremely annoying error after I try to leave my data processing functions: malloc: *** error for object 0x100810cc0: incorrect checksum for freed object - object was probably modified after being freed. *** set a breakpoint in malloc_error_break to debug Program received signal SIGABRT, Aborted. 0x7fff88c1ad46 in __kill () It took me a long time to find what was happening on that address because I wrapped the GSL code in a class which seems to cleanup everything after its destructor is called and, when the malloc error happens, that address has a null value again. I can send the complete code to whoever can help me to understand what is going on. Thanks a lot for your attention, Eduardo. Eduardo Novaes Hering, PhD (MSc, DSc). __ e.n.her...@icloud.com eduardo.her...@cea.fr enher...@gmail.com ___ INAC/SPSMS/IMAPEC, CEA Grenoble. 17 rue des Martyrs, 38054 Grenoble cedex 9. France. ___ Work: (+33)or(0) 4 58 78 35 03 Mobile: (+33)or(0) 6 59 70 60 12 Fax : (+33)or(0) 4 38 78 50 96 Home: (+33)or(0) 9 66 84 10 40 Skype: enhering Twitter: @enhering http://www.gabuleu.com/enhering ___ Stop buying from China until Tibet is free! ___ A difference which makes no difference is not a difference. Mr. Spock (stardate 2822.3) _ _ _ _ ___ _ _ _ _ _/ \_/ \_/ \_/ \_/ \_/ \_/ \_/ \_/ \_ signature.asc Description: Message signed with OpenPGP using GPGMail
Re: [Help-gsl] gsl performance
Rhys, I did try to use views. They do not help because the gsl routines allocate vectors internally and there is not much that I can do about it... except for maybe hacking gsl and changing gsl_vector_alloc myself. If from hacking around you happen to restructure the API so that a clean workspace can be allocated for a given problem size and then passed in to avoid the problematic internal allocations, please pass along a patch. There's a lot of precedent in the library for having low-level compute kernels wrapped by convenience methods, and refactoring a kernel to match that pattern would be most welcome provided the existing APIs remain unbroken. - Rhys