[Help-gsl] GSL performance compared to C++ alternatives such as Armadillo
Hello all, First of this is not a flame post, I would like to use GSL as linear algebra library for Neural Network training, so this would involve calculating hessian matrices, normal matrix and vector operations etc... I was curious as to how does GSL compares to C++ alternatives such as Armadillo, what are the pros and cons of using GSL as suppose to the alternatives? Additionally if someone could put me towards some benchmarks that would be great. Thanks Chris
Re: [Help-gsl] GSL performance compared to C++ alternatives such as Armadillo
> Armadillo would be faster. > GSL would be more reliable and wider used. Eigen may be another good alternative for C++. - Rhys
Re: [Help-gsl] gsl performance
If there is a chance of a patch getting accepted I will certainly submit it if I have one. I took a look at the code in the multimin directory last night (the last problem affects other parts of the library but my example was about the multidimensional minimizer, so I figured I would start there). I might be wrong, but the way I see it, gsl has two design characteristics that by themselves do not cause much harm but, if combined, make it very hard to use stack arrays internally. One, which I like, is that the routines give the user low-level control over their progress in the sense that you can create an object, manually iterate and observe their progress. I prefer this over having a single function that does the work until convergence (but see below). The other thing that seems to be common in the library is that it seems to never let the user handle the allocations. For example if I wanted to do this: gsl_multimin_fminimizer s; gsl_multimin_fminimizer_type T; gsl_multimin_fminimizer_type_init(T); gsl_multimin_fminimizer_init(s, T, n); I just could not do it because the library does not provide such init functions. The above has the advantage that it allows the caller to have objects in the stack. If ones wants arrays that are on the stack, this is necessary because: 1) Since the work is done across multiple function calls, the internal arrays need to live until the minimizer object dies, so the arrays need to be automatic variables inside the object. 2) If the internal arrays are automatic objects, they are allocated on the stack only if the object itself is allocated on the stack, so one really needs to have the option to allocate on the stack. The way I see it, there are two options to get around this, none of them having to break the API but rather extend it: 1) Provide functions that do all the work until convergence (or a user-specified maximum number of iterations) without creating an object. This gives the user less control, but they could always use the standard methods to control the iterations. This is what I do in my library, and by itself it provides a (small) performance improvement because you can skip the creation of objects. But most importantly for the present discussion, it allows the internal arrays to be automatic variables inside such function. 2) Provide additional init and finalize functions (or whatever we want to call them) to let the user handle the allocations (if she wants to) herself. I am not sure about which option I prefer. Gilberto On Sun, Oct 6, 2013 at 9:40 PM, Rhys Ulerich rhys.uler...@gmail.com wrote: Rhys, I did try to use views. They do not help because the gsl routines allocate vectors internally and there is not much that I can do about it... except for maybe hacking gsl and changing gsl_vector_alloc myself. If from hacking around you happen to restructure the API so that a clean workspace can be allocated for a given problem size and then passed in to avoid the problematic internal allocations, please pass along a patch. There's a lot of precedent in the library for having low-level compute kernels wrapped by convenience methods, and refactoring a kernel to match that pattern would be most welcome provided the existing APIs remain unbroken. - Rhys
Re: [Help-gsl] gsl performance
Hi, On 7 October 2013 18:22, onefire onefire.mys...@gmail.com wrote: One, which I like, is that the routines give the user low-level control over their progress in the sense that you can create an object, manually iterate and observe their progress. I prefer this over having a single function that does the work until convergence (but see below). Yes, everybody wants to terminate at different places—flat gradient, max number of steps, some combination or other condition... Having this outside of the GSL makes sense. You could let this be evaluated by another callback, but then this would be three callbacks now? Reading your earlier messages, implies that you want to perform many minimizations (or other algorithms?). Could you not just allocate one minimizer (or one per thread) and reset it as needed? That way you don't need to be free()ing/malloc()ing same same memory all the time. I guess it depends on whether the number of variables changes. I tried this technique with the ODE solvers in the GSL and it gave me about 5% overall performance improvement so I dropped it from my code, it was a fiddle to maintain and the user would barely notice the difference. I was doing quite a lot of other things, so maybe if your overall time is dominated by malloc/free it may help. Not sure whether it would be worth trying a different memory allocator. There used to be faster ones available, but they could well be folded into the standard libraries now. Hope some of that is useful for you! Sam
Re: [Help-gsl] gsl performance
I tried this technique with the ODE solvers in the GSL and it gave me about 5% overall performance improvement so I dropped it from my code, it was a fiddle to maintain and the user would barely notice the difference. I was doing quite a lot of other things, so maybe if your overall time is dominated by malloc/free it may help. I am not surprised by your results because, contrary to what my previous messages might suggest, I think that the main problem is not the allocations themselves but memory location. At least for certain problems, the machine is just much more efficient at accessing stack memory. Unfortunately, it seems that it is not trivial to implement the init functions that I suggested previously. This is because the minimizer has to accept different types of objects depending on the problem. The library currently uses void pointers, but that does not work if you need the objects to be known at compile-time. Here the lack of overloading in C really kills it (one would need many more names for types and functions, which could have a deep impact in the API. G On Mon, Oct 7, 2013 at 3:39 PM, Sam Mason sammasonwarwick...@gmail.comwrote: Hi, On 7 October 2013 18:22, onefire onefire.mys...@gmail.com wrote: One, which I like, is that the routines give the user low-level control over their progress in the sense that you can create an object, manually iterate and observe their progress. I prefer this over having a single function that does the work until convergence (but see below). Yes, everybody wants to terminate at different places—flat gradient, max number of steps, some combination or other condition... Having this outside of the GSL makes sense. You could let this be evaluated by another callback, but then this would be three callbacks now? Reading your earlier messages, implies that you want to perform many minimizations (or other algorithms?). Could you not just allocate one minimizer (or one per thread) and reset it as needed? That way you don't need to be free()ing/malloc()ing same same memory all the time. I guess it depends on whether the number of variables changes. I tried this technique with the ODE solvers in the GSL and it gave me about 5% overall performance improvement so I dropped it from my code, it was a fiddle to maintain and the user would barely notice the difference. I was doing quite a lot of other things, so maybe if your overall time is dominated by malloc/free it may help. Not sure whether it would be worth trying a different memory allocator. There used to be faster ones available, but they could well be folded into the standard libraries now. Hope some of that is useful for you! Sam
[Help-gsl] gsl performance
Hi all, I am creating a personal library for my C++ projects and to evaluate its performance I decided to compare it to gsl and, to my surprise, my library is much faster than gsl in many cases. For example, my Nelder-Mead implementation can be 5 or 10 faster for some functions. This has been bothering me a lot because: 1) My tests are correct, and I checked the results against output from Scipy and Octave. 2) I am certainly not smarter than the many people who worked on gsl over the years. 3) I think I know how to use gsl properly. Sorry I do not have an example, but my library is not polished enough for me to publish it yet. What I am really intrigued about is that I googled around and it seems that many people noticed (before me) that many gsl implementations are inefficient. I also found some posts on the mailing lists with things like gsl is much slower than Matlab and responses like gsl is not meant to be the fastest. These things lead me to believe that gsl is slow by design, If this is the case, can someone explain to me why this is so? Gilberto PS: Much of the gain with C++ comes from better inlining, which can be specially important with things like function optimization (it is hard, if not impossible, to inline when there is a funcyion pointer passed around). This is why std::sort can be much faster than lqsort. However, I am confident that it is possible to write faster implementations (in C) than some of the ones in gsl.
Re: [Help-gsl] gsl performance
Hi Gilbaerto, congrats on your performance results! A first guess of the worse performance of the gsl library would be exception throwing. The GSL is made for 'users' and this means, that a user has to be informed about the kind of exception he encounters. This can be left out if you have your own code and you know your own code. If something goes wrong, you know where to look in your code where the error occurred and what could be the reasons. A 'user' just using a library could be helpless when he encounters and error and does not know where it occurred. So checking and error catching could be a reason, that makes the gsl less performant but more appropriate for 'easy' usage. I use a lot the C++ linear algebra library Armadillo. This library has for example two different element accessors: One, '()' which makes checkings and the second 'at()' with no checkings. Performance increases sometimes tremendously when using the 'at()' method. Best Simon On Oct 6, 2013, at 12:44 AM, onefire onefire.mys...@gmail.com wrote: Hi all, I am creating a personal library for my C++ projects and to evaluate its performance I decided to compare it to gsl and, to my surprise, my library is much faster than gsl in many cases. For example, my Nelder-Mead implementation can be 5 or 10 faster for some functions. This has been bothering me a lot because: 1) My tests are correct, and I checked the results against output from Scipy and Octave. 2) I am certainly not smarter than the many people who worked on gsl over the years. 3) I think I know how to use gsl properly. Sorry I do not have an example, but my library is not polished enough for me to publish it yet. What I am really intrigued about is that I googled around and it seems that many people noticed (before me) that many gsl implementations are inefficient. I also found some posts on the mailing lists with things like gsl is much slower than Matlab and responses like gsl is not meant to be the fastest. These things lead me to believe that gsl is slow by design, If this is the case, can someone explain to me why this is so? Gilberto PS: Much of the gain with C++ comes from better inlining, which can be specially important with things like function optimization (it is hard, if not impossible, to inline when there is a funcyion pointer passed around). This is why std::sort can be much faster than lqsort. However, I am confident that it is possible to write faster implementations (in C) than some of the ones in gsl.
Re: [Help-gsl] gsl performance
I am aware of those. I tried the following: 1) Call gsl_set_error_handler_off() to turn off the error handler 2) Compile with -DHAVE_INLINE and -DGSL_RANGE_CHECK_OFF 3) Link to a different cblas (actually I tried openblas, mkl and gslcblas). However, my most interesting experiment was to modify my minimization algorithms to accept dynamic arrays. Originally, my function was using std::array to store the values of the independent variables when minimizing a multidimensional function. My experiment did mimic my use case: solve the same problem millions of times (to be precise, a million times). When I used dynamic arrays, my algorithm became much slower because of all the calls to malloc/new, and free/delete. So I did modify my algorithm to accept a gsl_vector. Guess what happened? My library became slower than gsl! I am not sure if this is because my Nelder-Mead implementation is different than gsl's (which I did not look at) or if its just the extra overhead of making the wrapper. My point is: The huge differences that I observed were because of gsl_vector which is not efficient for small arrays (stack memory is faster and you avoid the costs to malloc/free). So I think that the gsl minimization algorithms could have a version that uses stack arrays. I should not have to pay for the cost of dynamic allocation if I am minimizing a function of 5 variables (a typical version of the Nelder-Mead algorithm would require 6 points, including the function evaluations that would require an array of 36 elements, which is fine for the stack) . I think that gsl could have alternative implementations that use stack memory (with warnings to not use those if the problem is big). The other issue is: the implementation of gsl_vector just seems inefficient to me. Looking at the code, it seems like a single vector requires 3 calls to malloc and free (one for the data, one for the gsl_block, and one for the gsl_vector itself). The manual states that the block is there for consistency, and I can see how memory management becomes easier with it. But it seems to be a case of generality at the expense of performance. Also, the stride and the owner flag are part of the gsl_vector object to make it work with gsl_views, but then people who never need views pay the performance price anyway. All of these issues are not likely to matter for big vectors, but they do make a large difference when you are dealing with small objects. Am I the only one who thinks like this? Gilberto PS: As a side note, I prefer Eigen over Armadillo. In my experience, the former is almost always faster. On Sun, Oct 6, 2013 at 5:09 AM, Simon Zehnder simon.zehn...@googlemail.comwrote: Hi Gilbaerto, congrats on your performance results! A first guess of the worse performance of the gsl library would be exception throwing. The GSL is made for 'users' and this means, that a user has to be informed about the kind of exception he encounters. This can be left out if you have your own code and you know your own code. If something goes wrong, you know where to look in your code where the error occurred and what could be the reasons. A 'user' just using a library could be helpless when he encounters and error and does not know where it occurred. So checking and error catching could be a reason, that makes the gsl less performant but more appropriate for 'easy' usage. I use a lot the C++ linear algebra library Armadillo. This library has for example two different element accessors: One, '()' which makes checkings and the second 'at()' with no checkings. Performance increases sometimes tremendously when using the 'at()' method. Best Simon On Oct 6, 2013, at 12:44 AM, onefire onefire.mys...@gmail.com wrote: Hi all, I am creating a personal library for my C++ projects and to evaluate its performance I decided to compare it to gsl and, to my surprise, my library is much faster than gsl in many cases. For example, my Nelder-Mead implementation can be 5 or 10 faster for some functions. This has been bothering me a lot because: 1) My tests are correct, and I checked the results against output from Scipy and Octave. 2) I am certainly not smarter than the many people who worked on gsl over the years. 3) I think I know how to use gsl properly. Sorry I do not have an example, but my library is not polished enough for me to publish it yet. What I am really intrigued about is that I googled around and it seems that many people noticed (before me) that many gsl implementations are inefficient. I also found some posts on the mailing lists with things like gsl is much slower than Matlab and responses like gsl is not meant to be the fastest. These things lead me to believe that gsl is slow by design, If this is the case, can someone explain to me why this is so? Gilberto PS: Much of the gain with C++ comes from better inlining, which can be specially important with things like function
Re: [Help-gsl] gsl performance
Rhys, I did try to use views. They do not help because the gsl routines allocate vectors internally and there is not much that I can do about it... except for maybe hacking gsl and changing gsl_vector_alloc myself. The main target of a library like GSL is an inherent consistency of its objects and as it defines the gsl_vector and uses it everywhere in its methods, consistency is reached. . That's what I thought too. That would be more consistent with my initial question about gsl design not being about performance. I heard about nlopt before and I plan to benchmark my results against them as well. I started with gsl because I am more familiar with it. On Sun, Oct 6, 2013 at 4:38 PM, Simon Zehnder simon.zehn...@googlemail.comwrote: I write between the lines. On Oct 6, 2013, at 10:04 PM, onefire onefire.mys...@gmail.com wrote: I am aware of those. I tried the following: 1) Call gsl_set_error_handler_off() to turn off the error handler 2) Compile with -DHAVE_INLINE and -DGSL_RANGE_CHECK_OFF 3) Link to a different cblas (actually I tried openblas, mkl and gslcblas). However, my most interesting experiment was to modify my minimization algorithms to accept dynamic arrays. Originally, my function was using std::array to store the values of the independent variables when minimizing a multidimensional function. My experiment did mimic my use case: solve the same problem millions of times (to be precise, a million times). When I used dynamic arrays, my algorithm became much slower because of all the calls to malloc/new, and free/delete. So I did modify my algorithm to accept a gsl_vector. Guess what happened? My library became slower than gsl! I am not sure if this is because my Nelder-Mead implementation is different than gsl's (which I did not look at) or if its just the extra overhead of making the wrapper. Here a profiling could give clearer insights - if its possible - about the gsl functions that take the most time. My point is: The huge differences that I observed were because of gsl_vector which is not efficient for small arrays (stack memory is faster and you avoid the costs to malloc/free). So I think that the gsl minimization algorithms could have a version that uses stack arrays. I should not have to pay for the cost of dynamic allocation if I am minimizing a function of 5 variables (a typical version of the Nelder-Mead algorithm would require 6 points, including the function evaluations that would require an array of 36 elements, which is fine for the stack) . I think that gsl could have alternative implementations that use stack memory (with warnings to not use those if the problem is big). Interesting! Indeed in most statistical problems I am involved in around 6 parameters is also a good average. Physicists and Engineers may have larger parameter vectors. The main target of a library like GSL is an inherent consistency of its objects and as it defines the gsl_vector and uses it everywhere in its methods, consistency is reached. Performance is something that is not its main target I would guess. Providing a function though, that works with STL objects - commonly used by C++ developers is not a bad idea and - as you tell - it improves performance. Especially embedding GSL into self-developed code becomes easier with such a solution, as the step from own code to GSL-methods would be more 'fluently'. For another performance benchmark you could try the Nelder-Mead from nlopt. This works with std::vectors. The other issue is: the implementation of gsl_vector just seems inefficient to me. Looking at the code, it seems like a single vector requires 3 calls to malloc and free (one for the data, one for the gsl_block, and one for the gsl_vector itself). The manual states that the block is there for consistency, and I can see how memory management becomes easier with it. But it seems to be a case of generality at the expense of performance. Also, the stride and the owner flag are part of the gsl_vector object to make it work with gsl_views, but then people who never need views pay the performance price anyway. Again, I would assume the main developing target was inherent consistency. All of these issues are not likely to matter for big vectors, but they do make a large difference when you are dealing with small objects. Am I the only one who thinks like this? Gilberto PS: As a side note, I prefer Eigen over Armadillo. In my experience, the former is almost always faster. Thanks for the suggestion, I heard about it but never tried it. I will give it a shot. On Sun, Oct 6, 2013 at 5:09 AM, Simon Zehnder simon.zehn...@googlemail.com wrote: Hi Gilbaerto, congrats on your performance results! A first guess of the worse performance of the gsl library would be exception throwing. The GSL is made for 'users' and this means, that a user has to be informed about the kind of exception he encounters. This
Re: [Help-gsl] gsl performance
Rhys, I did try to use views. They do not help because the gsl routines allocate vectors internally and there is not much that I can do about it... except for maybe hacking gsl and changing gsl_vector_alloc myself. If from hacking around you happen to restructure the API so that a clean workspace can be allocated for a given problem size and then passed in to avoid the problematic internal allocations, please pass along a patch. There's a lot of precedent in the library for having low-level compute kernels wrapped by convenience methods, and refactoring a kernel to match that pattern would be most welcome provided the existing APIs remain unbroken. - Rhys