[Help-gsl] GSL performance compared to C++ alternatives such as Armadillo

2015-10-26 Thread Christopher Choi
Hello all,

First of this is not a flame post, I would like to use GSL as linear
algebra library for Neural Network training, so this would involve
calculating hessian matrices, normal matrix and vector operations etc... I
was curious as to how does GSL compares to C++ alternatives such as
Armadillo, what are the pros and cons of using GSL as suppose to the
alternatives?

Additionally if someone could put me towards some benchmarks that would be
great.

Thanks
Chris


Re: [Help-gsl] GSL performance compared to C++ alternatives such as Armadillo

2015-10-26 Thread Rhys Ulerich
> Armadillo would be faster.
> GSL would be more reliable and wider used.

Eigen may be another good alternative for C++.

- Rhys


Re: [Help-gsl] gsl performance

2013-10-07 Thread onefire
If there is a chance of a patch getting accepted I will certainly submit it
if I have one.

I took a look at the code in the multimin directory last night (the last
problem affects other parts of the library but my example was about the
multidimensional minimizer, so I figured I would start there).

I might be wrong, but the way I see it, gsl has two design characteristics
that by themselves do not cause much harm but, if combined, make it very
hard to use stack arrays internally.

One, which I like, is that the routines give the user low-level control
over their progress in the sense that you can create an object, manually
iterate and observe their progress. I prefer this over having a single
function that does the work until convergence (but see below).

The other thing that seems to be common in the library is that it seems to
never let the user handle the allocations. For example if I wanted to do
this:
gsl_multimin_fminimizer s;
gsl_multimin_fminimizer_type T;

gsl_multimin_fminimizer_type_init(T);
gsl_multimin_fminimizer_init(s, T, n);

I just could not do it because the library does not provide such init
functions.

The above has the advantage that it allows the caller to have objects in
the stack. If ones wants arrays that are on the stack, this is necessary
because:
1) Since the work is done across multiple function calls, the internal
arrays need to live until the minimizer object dies, so the arrays need to
be automatic variables inside the object.
2)  If the internal arrays are automatic objects, they are allocated on the
stack only if the object itself is allocated on the stack, so one really
needs to have the option to allocate on the stack.

The way I see it, there are two options to get around this, none of them
having to break the API but rather extend it:
1) Provide functions that do all the work until convergence (or a
user-specified maximum number of iterations) without creating an object.

This gives the user less control, but they could always use the standard
methods to control the iterations. This is what I do in my library, and by
itself it provides a (small) performance improvement because you can skip
the creation of objects. But most importantly for the present discussion,
it allows the internal arrays to be automatic variables inside such
function.

2) Provide additional init and finalize functions (or whatever we want to
call them) to let the user handle the allocations (if she wants to) herself.

I am not sure about which option I prefer.

Gilberto








On Sun, Oct 6, 2013 at 9:40 PM, Rhys Ulerich rhys.uler...@gmail.com wrote:

  Rhys, I did try to use views. They do not help because the gsl routines
  allocate vectors internally and there is not much that I can do about
 it...
  except for maybe hacking gsl and changing gsl_vector_alloc myself.

 If from hacking around you happen to restructure the API so that a
 clean workspace can be allocated for a given problem size and then
 passed in to avoid the problematic internal allocations, please pass
 along a patch. There's a lot of precedent in the library for having
 low-level compute kernels wrapped by convenience methods, and
 refactoring a kernel to match that pattern would be most welcome
 provided the existing APIs remain unbroken.

 - Rhys



Re: [Help-gsl] gsl performance

2013-10-07 Thread Sam Mason
Hi,

On 7 October 2013 18:22, onefire onefire.mys...@gmail.com wrote:
 One, which I like, is that the routines give the user low-level control
 over their progress in the sense that you can create an object, manually
 iterate and observe their progress. I prefer this over having a single
 function that does the work until convergence (but see below).

Yes, everybody wants to terminate at different places—flat gradient,
max number of steps, some combination or other condition...  Having
this outside of the GSL makes sense.  You could let this be evaluated
by another callback, but then this would be three callbacks now?

Reading your earlier messages, implies that you want to perform many
minimizations (or other algorithms?).  Could you not just allocate one
minimizer (or one per thread) and reset it as needed? That way you
don't need to be free()ing/malloc()ing same same memory all the
time.  I guess it depends on whether the number of variables changes.

I tried this technique with the ODE solvers in the GSL and it gave me
about 5% overall performance improvement so I dropped it from my code,
it was a fiddle to maintain and the user would barely notice the
difference.  I was doing quite a lot of other things, so maybe if your
overall time is dominated by malloc/free it may help.

Not sure whether it would be worth trying a different memory
allocator.  There used to be faster ones available, but they could
well be folded into the standard libraries now.

Hope some of that is useful for you!

  Sam



Re: [Help-gsl] gsl performance

2013-10-07 Thread onefire
I tried this technique with the ODE solvers in the GSL and it gave me
about 5% overall performance improvement so I dropped it from my code,
it was a fiddle to maintain and the user would barely notice the
difference.  I was doing quite a lot of other things, so maybe if your
overall time is dominated by malloc/free it may help.

I am not surprised by your results because, contrary to what my previous
messages might suggest, I think that the main problem is not the
allocations themselves but memory location. At least for certain problems,
the machine is just much more efficient at accessing stack memory.

Unfortunately, it seems that it is not trivial to implement the init
functions that I suggested previously. This is because the minimizer has to
accept different types of objects depending on the problem. The library
currently uses void pointers, but that does not work if you need the
objects to be known at compile-time. Here the lack of overloading in C
really kills it (one would need many more names for types and functions,
which could have a deep impact in the API.
G


On Mon, Oct 7, 2013 at 3:39 PM, Sam Mason sammasonwarwick...@gmail.comwrote:

 Hi,

 On 7 October 2013 18:22, onefire onefire.mys...@gmail.com wrote:
  One, which I like, is that the routines give the user low-level control
  over their progress in the sense that you can create an object, manually
  iterate and observe their progress. I prefer this over having a single
  function that does the work until convergence (but see below).

 Yes, everybody wants to terminate at different places—flat gradient,
 max number of steps, some combination or other condition...  Having
 this outside of the GSL makes sense.  You could let this be evaluated
 by another callback, but then this would be three callbacks now?

 Reading your earlier messages, implies that you want to perform many
 minimizations (or other algorithms?).  Could you not just allocate one
 minimizer (or one per thread) and reset it as needed? That way you
 don't need to be free()ing/malloc()ing same same memory all the
 time.  I guess it depends on whether the number of variables changes.

 I tried this technique with the ODE solvers in the GSL and it gave me
 about 5% overall performance improvement so I dropped it from my code,
 it was a fiddle to maintain and the user would barely notice the
 difference.  I was doing quite a lot of other things, so maybe if your
 overall time is dominated by malloc/free it may help.

 Not sure whether it would be worth trying a different memory
 allocator.  There used to be faster ones available, but they could
 well be folded into the standard libraries now.

 Hope some of that is useful for you!

   Sam



[Help-gsl] gsl performance

2013-10-06 Thread onefire
Hi all,

I am creating a personal library for my C++ projects and to evaluate its
performance I decided to compare it to gsl and, to my surprise, my library
is much faster than gsl in many cases. For example, my Nelder-Mead
implementation can be 5 or 10 faster for some functions.

This has been bothering me a lot because:
1) My tests are correct, and I checked the results against output from
Scipy and Octave.
2) I am certainly not smarter than the many people who worked on gsl over
the years.
3) I think I know how to use gsl properly.

Sorry I do not have an example, but my library is not polished enough for
me to publish it yet.

What I am really intrigued about  is that I googled around and it seems
that many people noticed (before me) that many gsl implementations are
inefficient. I also found some posts on the mailing lists with things like
gsl is much slower than Matlab and responses like gsl is not meant to be
the fastest.

These things lead me to believe that gsl is slow by design, If this is
the case, can someone explain to me why this is so?

Gilberto

PS: Much of the gain with C++ comes from better inlining, which can be
specially important with things like function optimization (it is hard, if
not impossible, to inline when there is a funcyion pointer passed around).
This is why std::sort can be much faster than lqsort. However, I am
confident that it is possible to write faster implementations (in C) than
some of the ones in gsl.


Re: [Help-gsl] gsl performance

2013-10-06 Thread Simon Zehnder
Hi Gilbaerto,

congrats on your performance results! 

A first guess of the worse performance of the gsl library would be exception 
throwing. The GSL is made for 'users' and this means, that a user has to be 
informed about the kind of exception he encounters. This can be left out if you 
have your own code and you know your own code. If something goes wrong, you 
know where to look in your code where the error occurred and what could be the 
reasons. A 'user' just using a library could be helpless when he encounters and 
error and does not know where it occurred. So checking and error catching could 
be a reason, that makes the gsl less performant but more appropriate for 'easy' 
usage. 

I use a lot the C++ linear algebra library Armadillo. This library has for 
example two different element accessors: One, '()' which makes checkings and 
the second 'at()' with no checkings. Performance increases sometimes 
tremendously when using the 'at()' method. 

Best

Simon


On Oct 6, 2013, at 12:44 AM, onefire onefire.mys...@gmail.com wrote:

 Hi all,
 
 I am creating a personal library for my C++ projects and to evaluate its
 performance I decided to compare it to gsl and, to my surprise, my library
 is much faster than gsl in many cases. For example, my Nelder-Mead
 implementation can be 5 or 10 faster for some functions.
 
 This has been bothering me a lot because:
 1) My tests are correct, and I checked the results against output from
 Scipy and Octave.
 2) I am certainly not smarter than the many people who worked on gsl over
 the years.
 3) I think I know how to use gsl properly.
 
 Sorry I do not have an example, but my library is not polished enough for
 me to publish it yet.
 
 What I am really intrigued about  is that I googled around and it seems
 that many people noticed (before me) that many gsl implementations are
 inefficient. I also found some posts on the mailing lists with things like
 gsl is much slower than Matlab and responses like gsl is not meant to be
 the fastest.
 
 These things lead me to believe that gsl is slow by design, If this is
 the case, can someone explain to me why this is so?
 
 Gilberto
 
 PS: Much of the gain with C++ comes from better inlining, which can be
 specially important with things like function optimization (it is hard, if
 not impossible, to inline when there is a funcyion pointer passed around).
 This is why std::sort can be much faster than lqsort. However, I am
 confident that it is possible to write faster implementations (in C) than
 some of the ones in gsl.




Re: [Help-gsl] gsl performance

2013-10-06 Thread onefire
I am aware of those. I tried the following:
1) Call gsl_set_error_handler_off() to turn off the error handler
2) Compile with -DHAVE_INLINE  and -DGSL_RANGE_CHECK_OFF
3) Link to a different cblas (actually I tried openblas, mkl and gslcblas).

However, my most interesting experiment was to modify my minimization
algorithms to accept dynamic arrays. Originally, my function was using
std::array to store the values of the independent variables when minimizing
a multidimensional function. My experiment did mimic my use case: solve the
same problem millions of times (to be precise, a million times). When I
used dynamic arrays, my algorithm became much slower because of all the
calls to malloc/new, and free/delete. So I did modify my algorithm to
accept a gsl_vector. Guess what happened? My library became slower than
gsl! I am not sure if this is because my Nelder-Mead implementation is
different than gsl's (which I did not look at) or if its just the extra
overhead of making the wrapper.

My point is: The huge differences that I observed were because of
gsl_vector which is not efficient for small arrays (stack memory is faster
and you avoid the costs to malloc/free). So I think that the gsl
minimization algorithms could have a version that uses stack arrays. I
should not have to pay for the cost of dynamic allocation if I am
minimizing a function of 5 variables (a typical version of the Nelder-Mead
algorithm would require 6 points, including the function evaluations that
would require an array of 36 elements, which is fine for the stack) . I
think that gsl could have alternative implementations that use stack memory
(with warnings to not use those if the problem is big).

The other issue is: the implementation of gsl_vector just seems inefficient
to me. Looking at the code, it seems like a single vector requires 3 calls
to malloc and free (one for the data, one for the gsl_block, and one for
the gsl_vector itself). The manual states that the block is there for
consistency, and I can see how memory management becomes easier with it.
But it seems to be a case of generality at the expense of performance.
Also, the stride and the owner flag are part of the gsl_vector object to
make it work with gsl_views, but then people who never need views pay the
performance price anyway.

All of these issues are not likely to matter for big vectors, but they do
make a large difference when you are dealing with small objects.

Am I the only one who thinks like this?

Gilberto

PS: As a side note, I prefer Eigen over Armadillo. In my experience, the
former is almost always faster.


On Sun, Oct 6, 2013 at 5:09 AM, Simon Zehnder
simon.zehn...@googlemail.comwrote:

 Hi Gilbaerto,

 congrats on your performance results!

 A first guess of the worse performance of the gsl library would be
 exception throwing. The GSL is made for 'users' and this means, that a user
 has to be informed about the kind of exception he encounters. This can be
 left out if you have your own code and you know your own code. If something
 goes wrong, you know where to look in your code where the error occurred
 and what could be the reasons. A 'user' just using a library could be
 helpless when he encounters and error and does not know where it occurred.
 So checking and error catching could be a reason, that makes the gsl less
 performant but more appropriate for 'easy' usage.

 I use a lot the C++ linear algebra library Armadillo. This library has for
 example two different element accessors: One, '()' which makes checkings
 and the second 'at()' with no checkings. Performance increases sometimes
 tremendously when using the 'at()' method.

 Best

 Simon


 On Oct 6, 2013, at 12:44 AM, onefire onefire.mys...@gmail.com wrote:

  Hi all,
 
  I am creating a personal library for my C++ projects and to evaluate its
  performance I decided to compare it to gsl and, to my surprise, my
 library
  is much faster than gsl in many cases. For example, my Nelder-Mead
  implementation can be 5 or 10 faster for some functions.
 
  This has been bothering me a lot because:
  1) My tests are correct, and I checked the results against output from
  Scipy and Octave.
  2) I am certainly not smarter than the many people who worked on gsl over
  the years.
  3) I think I know how to use gsl properly.
 
  Sorry I do not have an example, but my library is not polished enough for
  me to publish it yet.
 
  What I am really intrigued about  is that I googled around and it seems
  that many people noticed (before me) that many gsl implementations are
  inefficient. I also found some posts on the mailing lists with things
 like
  gsl is much slower than Matlab and responses like gsl is not meant to
 be
  the fastest.
 
  These things lead me to believe that gsl is slow by design, If this is
  the case, can someone explain to me why this is so?
 
  Gilberto
 
  PS: Much of the gain with C++ comes from better inlining, which can be
  specially important with things like function 

Re: [Help-gsl] gsl performance

2013-10-06 Thread onefire
Rhys, I did try to use views. They do not help because the gsl routines
allocate vectors internally and there is not much that I can do about it...
except for maybe hacking gsl and changing gsl_vector_alloc myself.

The main target of a library like GSL is an inherent consistency of its
objects and as it defines the gsl_vector and uses it everywhere in its
methods, consistency is reached. .

That's what I thought too. That would be more consistent with my initial
question about gsl design not being about performance.
I heard about nlopt before and I plan to benchmark my results against them
as well. I started with gsl because I am more familiar with it.



On Sun, Oct 6, 2013 at 4:38 PM, Simon Zehnder
simon.zehn...@googlemail.comwrote:

 I write between the lines.

 On Oct 6, 2013, at 10:04 PM, onefire onefire.mys...@gmail.com wrote:

  I am aware of those. I tried the following:
  1) Call gsl_set_error_handler_off() to turn off the error handler
  2) Compile with -DHAVE_INLINE  and -DGSL_RANGE_CHECK_OFF
  3) Link to a different cblas (actually I tried openblas, mkl and
 gslcblas).
 
  However, my most interesting experiment was to modify my minimization
 algorithms to accept dynamic arrays. Originally, my function was using
 std::array to store the values of the independent variables when minimizing
 a multidimensional function. My experiment did mimic my use case: solve the
 same problem millions of times (to be precise, a million times). When I
 used dynamic arrays, my algorithm became much slower because of all the
 calls to malloc/new, and free/delete. So I did modify my algorithm to
 accept a gsl_vector. Guess what happened? My library became slower than
 gsl! I am not sure if this is because my Nelder-Mead implementation is
 different than gsl's (which I did not look at) or if its just the extra
 overhead of making the wrapper.
 
 Here a profiling could give clearer insights - if its possible - about the
 gsl functions that take the most time.
  My point is: The huge differences that I observed were because of
 gsl_vector which is not efficient for small arrays (stack memory is faster
 and you avoid the costs to malloc/free). So I think that the gsl
 minimization algorithms could have a version that uses stack arrays. I
 should not have to pay for the cost of dynamic allocation if I am
 minimizing a function of 5 variables (a typical version of the Nelder-Mead
 algorithm would require 6 points, including the function evaluations that
 would require an array of 36 elements, which is fine for the stack) . I
 think that gsl could have alternative implementations that use stack memory
 (with warnings to not use those if the problem is big).
 
 Interesting! Indeed in most statistical problems I am involved in around 6
 parameters is also a good average. Physicists and Engineers may have larger
 parameter vectors. The main target of a library like GSL is an inherent
 consistency of its objects and as it defines the gsl_vector and uses it
 everywhere in its methods, consistency is reached. Performance is something
 that is not its main target I would guess. Providing a function though,
 that works with STL objects - commonly used by C++ developers is not a bad
 idea and - as you tell - it improves performance. Especially embedding GSL
 into self-developed code becomes easier with such a solution, as the step
 from own code to GSL-methods would be more 'fluently'. For another
 performance benchmark you could try the Nelder-Mead from nlopt. This works
 with std::vectors.

  The other issue is: the implementation of gsl_vector just seems
 inefficient to me. Looking at the code, it seems like a single vector
 requires 3 calls to malloc and free (one for the data, one for the
 gsl_block, and one for the gsl_vector itself). The manual states that the
 block is there for consistency, and I can see how memory management
 becomes easier with it. But it seems to be a case of generality at the
 expense of performance. Also, the stride and the owner flag are part of the
 gsl_vector object to make it work with gsl_views, but then people who never
 need views pay the performance price anyway.
 
 Again, I would assume the main developing target was inherent consistency.

  All of these issues are not likely to matter for big vectors, but they
 do make a large difference when you are dealing with small objects.
 
  Am I the only one who thinks like this?
 
  Gilberto
 
  PS: As a side note, I prefer Eigen over Armadillo. In my experience, the
 former is almost always faster.
 
 Thanks for the suggestion, I heard about it but never tried it. I will
 give it a shot.
 
  On Sun, Oct 6, 2013 at 5:09 AM, Simon Zehnder 
 simon.zehn...@googlemail.com wrote:
  Hi Gilbaerto,
 
  congrats on your performance results!
 
  A first guess of the worse performance of the gsl library would be
 exception throwing. The GSL is made for 'users' and this means, that a user
 has to be informed about the kind of exception he encounters. This 

Re: [Help-gsl] gsl performance

2013-10-06 Thread Rhys Ulerich
 Rhys, I did try to use views. They do not help because the gsl routines
 allocate vectors internally and there is not much that I can do about it...
 except for maybe hacking gsl and changing gsl_vector_alloc myself.

If from hacking around you happen to restructure the API so that a
clean workspace can be allocated for a given problem size and then
passed in to avoid the problematic internal allocations, please pass
along a patch. There's a lot of precedent in the library for having
low-level compute kernels wrapped by convenience methods, and
refactoring a kernel to match that pattern would be most welcome
provided the existing APIs remain unbroken.

- Rhys