[Help-gsl] gsl performance

2013-10-06 Thread onefire
Hi all,

I am creating a personal library for my C++ projects and to evaluate its
performance I decided to compare it to gsl and, to my surprise, my library
is much faster than gsl in many cases. For example, my Nelder-Mead
implementation can be 5 or 10 faster for some functions.

This has been bothering me a lot because:
1) My tests are correct, and I checked the results against output from
Scipy and Octave.
2) I am certainly not smarter than the many people who worked on gsl over
the years.
3) I think I know how to use gsl properly.

Sorry I do not have an example, but my library is not polished enough for
me to publish it yet.

What I am really intrigued about  is that I googled around and it seems
that many people noticed (before me) that many gsl implementations are
inefficient. I also found some posts on the mailing lists with things like
gsl is much slower than Matlab and responses like gsl is not meant to be
the fastest.

These things lead me to believe that gsl is slow by design, If this is
the case, can someone explain to me why this is so?

Gilberto

PS: Much of the gain with C++ comes from better inlining, which can be
specially important with things like function optimization (it is hard, if
not impossible, to inline when there is a funcyion pointer passed around).
This is why std::sort can be much faster than lqsort. However, I am
confident that it is possible to write faster implementations (in C) than
some of the ones in gsl.


Re: [Help-gsl] gsl performance

2013-10-06 Thread Simon Zehnder
Hi Gilbaerto,

congrats on your performance results! 

A first guess of the worse performance of the gsl library would be exception 
throwing. The GSL is made for 'users' and this means, that a user has to be 
informed about the kind of exception he encounters. This can be left out if you 
have your own code and you know your own code. If something goes wrong, you 
know where to look in your code where the error occurred and what could be the 
reasons. A 'user' just using a library could be helpless when he encounters and 
error and does not know where it occurred. So checking and error catching could 
be a reason, that makes the gsl less performant but more appropriate for 'easy' 
usage. 

I use a lot the C++ linear algebra library Armadillo. This library has for 
example two different element accessors: One, '()' which makes checkings and 
the second 'at()' with no checkings. Performance increases sometimes 
tremendously when using the 'at()' method. 

Best

Simon


On Oct 6, 2013, at 12:44 AM, onefire onefire.mys...@gmail.com wrote:

 Hi all,
 
 I am creating a personal library for my C++ projects and to evaluate its
 performance I decided to compare it to gsl and, to my surprise, my library
 is much faster than gsl in many cases. For example, my Nelder-Mead
 implementation can be 5 or 10 faster for some functions.
 
 This has been bothering me a lot because:
 1) My tests are correct, and I checked the results against output from
 Scipy and Octave.
 2) I am certainly not smarter than the many people who worked on gsl over
 the years.
 3) I think I know how to use gsl properly.
 
 Sorry I do not have an example, but my library is not polished enough for
 me to publish it yet.
 
 What I am really intrigued about  is that I googled around and it seems
 that many people noticed (before me) that many gsl implementations are
 inefficient. I also found some posts on the mailing lists with things like
 gsl is much slower than Matlab and responses like gsl is not meant to be
 the fastest.
 
 These things lead me to believe that gsl is slow by design, If this is
 the case, can someone explain to me why this is so?
 
 Gilberto
 
 PS: Much of the gain with C++ comes from better inlining, which can be
 specially important with things like function optimization (it is hard, if
 not impossible, to inline when there is a funcyion pointer passed around).
 This is why std::sort can be much faster than lqsort. However, I am
 confident that it is possible to write faster implementations (in C) than
 some of the ones in gsl.




Re: [Help-gsl] gsl performance

2013-10-06 Thread onefire
I am aware of those. I tried the following:
1) Call gsl_set_error_handler_off() to turn off the error handler
2) Compile with -DHAVE_INLINE  and -DGSL_RANGE_CHECK_OFF
3) Link to a different cblas (actually I tried openblas, mkl and gslcblas).

However, my most interesting experiment was to modify my minimization
algorithms to accept dynamic arrays. Originally, my function was using
std::array to store the values of the independent variables when minimizing
a multidimensional function. My experiment did mimic my use case: solve the
same problem millions of times (to be precise, a million times). When I
used dynamic arrays, my algorithm became much slower because of all the
calls to malloc/new, and free/delete. So I did modify my algorithm to
accept a gsl_vector. Guess what happened? My library became slower than
gsl! I am not sure if this is because my Nelder-Mead implementation is
different than gsl's (which I did not look at) or if its just the extra
overhead of making the wrapper.

My point is: The huge differences that I observed were because of
gsl_vector which is not efficient for small arrays (stack memory is faster
and you avoid the costs to malloc/free). So I think that the gsl
minimization algorithms could have a version that uses stack arrays. I
should not have to pay for the cost of dynamic allocation if I am
minimizing a function of 5 variables (a typical version of the Nelder-Mead
algorithm would require 6 points, including the function evaluations that
would require an array of 36 elements, which is fine for the stack) . I
think that gsl could have alternative implementations that use stack memory
(with warnings to not use those if the problem is big).

The other issue is: the implementation of gsl_vector just seems inefficient
to me. Looking at the code, it seems like a single vector requires 3 calls
to malloc and free (one for the data, one for the gsl_block, and one for
the gsl_vector itself). The manual states that the block is there for
consistency, and I can see how memory management becomes easier with it.
But it seems to be a case of generality at the expense of performance.
Also, the stride and the owner flag are part of the gsl_vector object to
make it work with gsl_views, but then people who never need views pay the
performance price anyway.

All of these issues are not likely to matter for big vectors, but they do
make a large difference when you are dealing with small objects.

Am I the only one who thinks like this?

Gilberto

PS: As a side note, I prefer Eigen over Armadillo. In my experience, the
former is almost always faster.


On Sun, Oct 6, 2013 at 5:09 AM, Simon Zehnder
simon.zehn...@googlemail.comwrote:

 Hi Gilbaerto,

 congrats on your performance results!

 A first guess of the worse performance of the gsl library would be
 exception throwing. The GSL is made for 'users' and this means, that a user
 has to be informed about the kind of exception he encounters. This can be
 left out if you have your own code and you know your own code. If something
 goes wrong, you know where to look in your code where the error occurred
 and what could be the reasons. A 'user' just using a library could be
 helpless when he encounters and error and does not know where it occurred.
 So checking and error catching could be a reason, that makes the gsl less
 performant but more appropriate for 'easy' usage.

 I use a lot the C++ linear algebra library Armadillo. This library has for
 example two different element accessors: One, '()' which makes checkings
 and the second 'at()' with no checkings. Performance increases sometimes
 tremendously when using the 'at()' method.

 Best

 Simon


 On Oct 6, 2013, at 12:44 AM, onefire onefire.mys...@gmail.com wrote:

  Hi all,
 
  I am creating a personal library for my C++ projects and to evaluate its
  performance I decided to compare it to gsl and, to my surprise, my
 library
  is much faster than gsl in many cases. For example, my Nelder-Mead
  implementation can be 5 or 10 faster for some functions.
 
  This has been bothering me a lot because:
  1) My tests are correct, and I checked the results against output from
  Scipy and Octave.
  2) I am certainly not smarter than the many people who worked on gsl over
  the years.
  3) I think I know how to use gsl properly.
 
  Sorry I do not have an example, but my library is not polished enough for
  me to publish it yet.
 
  What I am really intrigued about  is that I googled around and it seems
  that many people noticed (before me) that many gsl implementations are
  inefficient. I also found some posts on the mailing lists with things
 like
  gsl is much slower than Matlab and responses like gsl is not meant to
 be
  the fastest.
 
  These things lead me to believe that gsl is slow by design, If this is
  the case, can someone explain to me why this is so?
 
  Gilberto
 
  PS: Much of the gain with C++ comes from better inlining, which can be
  specially important with things like function 

Re: [Help-gsl] gsl performance

2013-10-06 Thread onefire
Rhys, I did try to use views. They do not help because the gsl routines
allocate vectors internally and there is not much that I can do about it...
except for maybe hacking gsl and changing gsl_vector_alloc myself.

The main target of a library like GSL is an inherent consistency of its
objects and as it defines the gsl_vector and uses it everywhere in its
methods, consistency is reached. .

That's what I thought too. That would be more consistent with my initial
question about gsl design not being about performance.
I heard about nlopt before and I plan to benchmark my results against them
as well. I started with gsl because I am more familiar with it.



On Sun, Oct 6, 2013 at 4:38 PM, Simon Zehnder
simon.zehn...@googlemail.comwrote:

 I write between the lines.

 On Oct 6, 2013, at 10:04 PM, onefire onefire.mys...@gmail.com wrote:

  I am aware of those. I tried the following:
  1) Call gsl_set_error_handler_off() to turn off the error handler
  2) Compile with -DHAVE_INLINE  and -DGSL_RANGE_CHECK_OFF
  3) Link to a different cblas (actually I tried openblas, mkl and
 gslcblas).
 
  However, my most interesting experiment was to modify my minimization
 algorithms to accept dynamic arrays. Originally, my function was using
 std::array to store the values of the independent variables when minimizing
 a multidimensional function. My experiment did mimic my use case: solve the
 same problem millions of times (to be precise, a million times). When I
 used dynamic arrays, my algorithm became much slower because of all the
 calls to malloc/new, and free/delete. So I did modify my algorithm to
 accept a gsl_vector. Guess what happened? My library became slower than
 gsl! I am not sure if this is because my Nelder-Mead implementation is
 different than gsl's (which I did not look at) or if its just the extra
 overhead of making the wrapper.
 
 Here a profiling could give clearer insights - if its possible - about the
 gsl functions that take the most time.
  My point is: The huge differences that I observed were because of
 gsl_vector which is not efficient for small arrays (stack memory is faster
 and you avoid the costs to malloc/free). So I think that the gsl
 minimization algorithms could have a version that uses stack arrays. I
 should not have to pay for the cost of dynamic allocation if I am
 minimizing a function of 5 variables (a typical version of the Nelder-Mead
 algorithm would require 6 points, including the function evaluations that
 would require an array of 36 elements, which is fine for the stack) . I
 think that gsl could have alternative implementations that use stack memory
 (with warnings to not use those if the problem is big).
 
 Interesting! Indeed in most statistical problems I am involved in around 6
 parameters is also a good average. Physicists and Engineers may have larger
 parameter vectors. The main target of a library like GSL is an inherent
 consistency of its objects and as it defines the gsl_vector and uses it
 everywhere in its methods, consistency is reached. Performance is something
 that is not its main target I would guess. Providing a function though,
 that works with STL objects - commonly used by C++ developers is not a bad
 idea and - as you tell - it improves performance. Especially embedding GSL
 into self-developed code becomes easier with such a solution, as the step
 from own code to GSL-methods would be more 'fluently'. For another
 performance benchmark you could try the Nelder-Mead from nlopt. This works
 with std::vectors.

  The other issue is: the implementation of gsl_vector just seems
 inefficient to me. Looking at the code, it seems like a single vector
 requires 3 calls to malloc and free (one for the data, one for the
 gsl_block, and one for the gsl_vector itself). The manual states that the
 block is there for consistency, and I can see how memory management
 becomes easier with it. But it seems to be a case of generality at the
 expense of performance. Also, the stride and the owner flag are part of the
 gsl_vector object to make it work with gsl_views, but then people who never
 need views pay the performance price anyway.
 
 Again, I would assume the main developing target was inherent consistency.

  All of these issues are not likely to matter for big vectors, but they
 do make a large difference when you are dealing with small objects.
 
  Am I the only one who thinks like this?
 
  Gilberto
 
  PS: As a side note, I prefer Eigen over Armadillo. In my experience, the
 former is almost always faster.
 
 Thanks for the suggestion, I heard about it but never tried it. I will
 give it a shot.
 
  On Sun, Oct 6, 2013 at 5:09 AM, Simon Zehnder 
 simon.zehn...@googlemail.com wrote:
  Hi Gilbaerto,
 
  congrats on your performance results!
 
  A first guess of the worse performance of the gsl library would be
 exception throwing. The GSL is made for 'users' and this means, that a user
 has to be informed about the kind of exception he encounters. This 

[Help-gsl] Possible leak in gsl_multifit_fdfsolver.

2013-10-06 Thread Eduardo Novaes Hering
Hi.

I've been struggling with this problem for a few weeks
now, but I do not have enough experience with debugging
to be sure about what is written below.

I'll quote some simple code, which I believe you know
well:

 51:  s = gsl_multifit_fdfsolver_alloc (T, n, p);
 52:  gsl_multifit_fdfsolver_set (s, f, x.vector);
 
 56:  do {
 57: iter++;
 58: status = gsl_multifit_fdfsolver_iterate (s);
 
 64: if (status)
 65:   break;
 
 67: status = gsl_multifit_test_delta (s-dx, s-x,
 68:   1e-3, 1e-3);
 69:  }
 70:  while (status == GSL_CONTINUE  iter  500);
 
 72:  gsl_multifit_covar (s-J, 0.0, covar);
 
 77:  double chi = gsl_blas_dnrm2(s-f);
 
 89:  for (int nN = 0; nN  nNumParameters; nN++) {
 90:padParameters[nN] = gsl_vector_get(s-x, nN);
 91:  }
 
 93:  gsl_multifit_fdfsolver_free (s);
 94:  gsl_matrix_free (covar);


Where, from gdb:

 (gdb) p s
 $68 = (gsl_multifit_fdfsolver *) 0x10080f270
 (gdb) p *s
 $69 = {
   type = 0x1001b6de0, 
   fdf = 0x7fff5fbfdbc0, 
   x = 0x100810480, 
   f = 0x10080f6c0, 
   J = 0x10080f740, 
   dx = 0x100810a00, 
   state = 0x100810cb0
 }
 (gdb) whatis s-state
 type = void *


Now, if I step through this code, keeping an eye on 
address 0x100810cc0:

 51  s = gsl_multifit_fdfsolver_alloc (T, n, p);
 (gdb) x 0x100810cc0
 0x100810cc0:  0x
 (gdb) s
 52  gsl_multifit_fdfsolver_set (s, f, x.vector);
 (gdb) x 0x100810cc0
 0x100810cc0:  0x
 (gdb) s
 57 iter++;
 (gdb) x 0x100810cc0
 0x100810cc0:  0x052e3967
 (gdb) s
 58 status = gsl_multifit_fdfsolver_iterate (s);
 (gdb) x 0x100810cc0
 0x100810cc0:  0x052e3967
 (gdb) s 
 64 if (status)
 (gdb) x 0x100810cc0
 0x100810cc0:  0xa697930c
 (gdb) s
 67 status = gsl_multifit_test_delta (s-dx, s-x,
 (gdb) x 0x100810cc0
 0x100810cc0:  0xa697930c
 (gdb) s
 69  }
 (gdb) x 0x100810cc0
 0x100810cc0:  0xa697930c
 (gdb) s
 57 iter++;
 (gdb) x 0x100810cc0
 0x100810cc0:  0xa697930c
 (gdb) s
 58 status = gsl_multifit_fdfsolver_iterate (s);
 (gdb) x 0x100810cc0
 0x100810cc0:  0xa697930c
 (gdb) s
 64 if (status)
 (gdb) x 0x100810cc0
 0x100810cc0:  0x2d9931ef
 (gdb) s
 67 status = gsl_multifit_test_delta (s-dx, s-x,
 (gdb) b 92
 Breakpoint 11 at 0x1b774: file Fit.cpp, line 92.
 (gdb) c
 Continuing.

 ...

 Breakpoint 11, Fit::FitData (this=0x100810120, cXData=0x7fff5fbfded8, 
 cYData=0x7fff5fbfdea0, cYStdDevData=0x7fff5fbfde68, dStdDev=0x7fff5fbfdf60, 
 padParameters=0x100810130, nNumParameters=6) at Fit.cpp:93
 93  gsl_multifit_fdfsolver_free (s);
 (gdb) x 0x100810cc0
 0x100810cc0:  0x0fb668f4
 (gdb) s
 94  gsl_matrix_free (covar);
 (gdb) x 0x100810cc0
 0x100810cc0:  0x0fb600bb
 (gdb) s
 96  *dStdDev = chi;
 (gdb) x 0x100810cc0
 0x100810cc0:  0x0fb600bb

This is a completely reproductive error. The address
0x100810cc0 has a null value in all the other areas
of the code. It assumes a non null value only after
line 52 and changes its value every time status is
accessed. 

In the end, when the gsl_multifit_fdfsolver should
be freed, some trash is left in that address and
I get the following, intermittent, extremely annoying
error after I try to leave my data processing functions:

  malloc: *** error for object 0x100810cc0: incorrect checksum for freed 
 object - object was probably modified after being freed.
 *** set a breakpoint in malloc_error_break to debug
 
 Program received signal SIGABRT, Aborted.
 0x7fff88c1ad46 in __kill ()


It took me a long time to find what was happening on
that address because I wrapped the GSL code in a
class which seems to cleanup everything after its
destructor is called and, when the malloc error
happens, that address has a null value again.

I can send the complete code to whoever can help
me to understand what is going on.

Thanks a lot for your attention,

Eduardo.


Eduardo Novaes Hering, PhD (MSc, DSc).
__
e.n.her...@icloud.com
eduardo.her...@cea.fr
enher...@gmail.com
___

INAC/SPSMS/IMAPEC, CEA Grenoble.
17 rue des Martyrs, 38054 
Grenoble cedex 9. France.
___

Work:   (+33)or(0) 4 58 78 35 03
Mobile: (+33)or(0) 6 59 70 60 12
Fax :   (+33)or(0) 4 38 78 50 96
Home:   (+33)or(0) 9 66 84 10 40
Skype: enhering
Twitter: @enhering
http://www.gabuleu.com/enhering
___

Stop buying from China 
 until Tibet is free!
___

 A difference 
which makes no difference
  is not a difference.

Mr. Spock (stardate 2822.3)
  _   _   _   _   ___   _   _   _   _
_/ \_/ \_/ \_/ \_/   \_/ \_/ \_/ \_/ \_




signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: [Help-gsl] gsl performance

2013-10-06 Thread Rhys Ulerich
 Rhys, I did try to use views. They do not help because the gsl routines
 allocate vectors internally and there is not much that I can do about it...
 except for maybe hacking gsl and changing gsl_vector_alloc myself.

If from hacking around you happen to restructure the API so that a
clean workspace can be allocated for a given problem size and then
passed in to avoid the problematic internal allocations, please pass
along a patch. There's a lot of precedent in the library for having
low-level compute kernels wrapped by convenience methods, and
refactoring a kernel to match that pattern would be most welcome
provided the existing APIs remain unbroken.

- Rhys