[sage-devel] Re: Call qsort from cython code .... with an inlined comparison function ?!

2014-11-23 Thread Volker Braun
Did you profile your code on the C-level? e.g. using gprof? As a rule of 
thumb, guesses about where the bottleneck is are wrong :-)  Its entirely 
conceivable that branch prediction and speculative execution solve this 
already for you.

C++ std::sort will be able to inline the comparator.

Link-time optimization (e.g. gcc -flto) can in principle also inline on the 
level of object code, after the compilation did not inline (because of 
different compilation units, say).Though the libc sort is in a shared 
library.


On Sunday, November 23, 2014 4:33:46 PM UTC, Nathann Cohen wrote:

 Hello everybody,

 I wrote a bruteforce Cython code recently (#17309) which spends most of 
 its time on calls to qsort.

 This is normal, sorting is sort of the most expensive thing I do, but to 
 call qsort you need to provide a comparison function. Now, as qsort is 
 compiled in a library, the comparison function cannot be inlined inside of 
 qsort and I suspect that it has a nontrivial cost (given how simple the 
 comparison is).

 Thus, if I copy/paste the original code of qsort into my Cython file the 
 code should be faster, only that is ugly.

 Sooo... Do you know if there ais  way to re-compile my code along with 
 the code of qsort without having to copy/paste it ?

 Thanks,

 Nathann


-- 
You received this message because you are subscribed to the Google Groups 
sage-devel group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to sage-devel+unsubscr...@googlegroups.com.
To post to this group, send email to sage-devel@googlegroups.com.
Visit this group at http://groups.google.com/group/sage-devel.
For more options, visit https://groups.google.com/d/optout.


Re: [sage-devel] Re: Call qsort from cython code .... with an inlined comparison function ?!

2014-11-23 Thread Nathann Cohen
Yo !

 Did you profile your code on the C-level? e.g. using gprof? As a rule of 
 thumb, guesses about where the bottleneck is are wrong :-)  Its entirely 
 conceivable that branch prediction and speculative execution solve this 
 already for you.

Yeees daddy :-P

I did that like a grown man with perf record :-P

 C++ std::sort will be able to inline the comparator.

Oh cool ! Thanks I will try to call that from Cython.

 Link-time optimization (e.g. gcc -flto) can in principle also inline on the 
 level of object code, after the compilation did not inline (because of 
 different compilation units, say).Though the libc sort is in a shared library.

Oh... So when does that work ? I mean, if you compile everything from
source already there is nothing to optimize at link time, is there ?

Nathann

-- 
You received this message because you are subscribed to the Google Groups 
sage-devel group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to sage-devel+unsubscr...@googlegroups.com.
To post to this group, send email to sage-devel@googlegroups.com.
Visit this group at http://groups.google.com/group/sage-devel.
For more options, visit https://groups.google.com/d/optout.


Re: [sage-devel] Re: Call qsort from cython code .... with an inlined comparison function ?!

2014-11-23 Thread Thierry Dumont

Le 23/11/2014 18:07, Volker Braun a écrit :

Did you profile your code on the C-level? e.g. using gprof? As a rule of
thumb, guesses about where the bottleneck is are wrong :-)  Its entirely
conceivable that branch prediction and speculative execution solve this
already for you.



Is gprof enough powerful with modern architectures on such programs? 
from my point of view, no.
There are non free, commercial, tools like vtune which can do fantastic 
measurement job. Vtune shows, for example, that a call to std::copy is 
not as fast as a for loop, which is turned by the compiler in a memcopy 
(probably std::copy is not!). I do not think we can see this with gprof.

But ok, you are not supposed to buy vtune...

What about likwid https://code.google.com/p/likwid ? It is free. Did 
somebody used it to measure cython code performances?


Likwid (and Vtune) have in common to use  performance counters on Intel 
and AMD processors (not sure for AMD with Vtune...).


What is the size of what you are sorting ? If it is small enough to fit 
in the caches, and better in the L1 cache, you can possibly improve 
something with your modification, but otherwise it is certainly memory 
bounded and you cannot do much...
You have to measure the bandwidth of your program. Vtune does this, 
possibly likwid too.


t.d.

C++ std::sort will be able to inline the comparator.

Link-time optimization (e.g. gcc -flto) can in principle also inline on
the level of object code, after the compilation did not inline (because
of different compilation units, say).Though the libc sort is in a shared
library.


On Sunday, November 23, 2014 4:33:46 PM UTC, Nathann Cohen wrote:

Hello everybody,

I wrote a bruteforce Cython code recently (#17309) which spends most
of its time on calls to qsort.

This is normal, sorting is sort of the most expensive thing I do,
but to call qsort you need to provide a comparison function. Now, as
qsort is compiled in a library, the comparison function cannot be
inlined inside of qsort and I suspect that it has a nontrivial cost
(given how simple the comparison is).

Thus, if I copy/paste the original code of qsort into my Cython file
the code should be faster, only that is ugly.

Sooo... Do you know if there ais  way to re-compile my code
along with the code of qsort without having to copy/paste it ?

Thanks,

Nathann

--
You received this message because you are subscribed to the Google
Groups sage-devel group.
To unsubscribe from this group and stop receiving emails from it, send
an email to sage-devel+unsubscr...@googlegroups.com
mailto:sage-devel+unsubscr...@googlegroups.com.
To post to this group, send email to sage-devel@googlegroups.com
mailto:sage-devel@googlegroups.com.
Visit this group at http://groups.google.com/group/sage-devel.
For more options, visit https://groups.google.com/d/optout.


--
You received this message because you are subscribed to the Google Groups 
sage-devel group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to sage-devel+unsubscr...@googlegroups.com.
To post to this group, send email to sage-devel@googlegroups.com.
Visit this group at http://groups.google.com/group/sage-devel.
For more options, visit https://groups.google.com/d/optout.
attachment: tdumont.vcf

Re: [sage-devel] Re: Call qsort from cython code .... with an inlined comparison function ?!

2014-11-23 Thread Francesco Biscani
On 23 November 2014 at 18:07, Volker Braun vbraun.n...@gmail.com wrote:

 C++ std::sort will be able to inline the comparator.


+1

std::sort() will do exactly what you describe, only in a type-safe and
compiler-checked automatic way.

-- 
You received this message because you are subscribed to the Google Groups 
sage-devel group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to sage-devel+unsubscr...@googlegroups.com.
To post to this group, send email to sage-devel@googlegroups.com.
Visit this group at http://groups.google.com/group/sage-devel.
For more options, visit https://groups.google.com/d/optout.


Re: [sage-devel] Re: Call qsort from cython code .... with an inlined comparison function ?!

2014-11-23 Thread Nathann Cohen
Hello !

 What about likwid https://code.google.com/p/likwid ? It is free. Did
 somebody used it to measure cython code performances?

Never tried vtune, nor likwid.

 What is the size of what you are sorting ? If it is small enough to fit in
 the caches, and better in the L1 cache, you can possibly improve something
 with your modification, but otherwise it is certainly memory bounded and you
 cannot do much...

I sort many small arrays. Several kB, not more.

 You have to measure the bandwidth of your program. Vtune does this, possibly
 likwid too.

Oh. Do you think that it can be used from outside the program, i.e. on
a running Sage session ?

Nathann

-- 
You received this message because you are subscribed to the Google Groups 
sage-devel group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to sage-devel+unsubscr...@googlegroups.com.
To post to this group, send email to sage-devel@googlegroups.com.
Visit this group at http://groups.google.com/group/sage-devel.
For more options, visit https://groups.google.com/d/optout.


Re: [sage-devel] Re: Call qsort from cython code .... with an inlined comparison function ?!

2014-11-23 Thread Francesco Biscani
On 23 November 2014 at 19:05, Thierry Dumont tdum...@math.univ-lyon1.fr
wrote:

 Is gprof enough powerful with modern architectures on such programs? from
 my point of view, no.
 There are non free, commercial, tools like vtune which can do fantastic
 measurement job. Vtune shows, for example, that a call to std::copy is not
 as fast as a for loop, which is turned by the compiler in a memcopy
 (probably std::copy is not!). I do not think we can see this with gprof.
 But ok, you are not supposed to buy vtune...


I would be surprised if any modern c++ library implementation does not have
specialisations of std::copy for POD types that use memcpy() or some other
trick.


 What about likwid https://code.google.com/p/likwid ? It is free. Did
 somebody used it to measure cython code performances?

 Likwid (and Vtune) have in common to use  performance counters on Intel
 and AMD processors (not sure for AMD with Vtune...).

 What is the size of what you are sorting ? If it is small enough to fit in
 the caches, and better in the L1 cache, you can possibly improve something
 with your modification, but otherwise it is certainly memory bounded and
 you cannot do much...
 You have to measure the bandwidth of your program. Vtune does this,
 possibly likwid too.


I used callgrind() in the past with some success... I would like to try the
google cpu profiler to see how it fares, but I haven't had the chance yet.

-- 
You received this message because you are subscribed to the Google Groups 
sage-devel group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to sage-devel+unsubscr...@googlegroups.com.
To post to this group, send email to sage-devel@googlegroups.com.
Visit this group at http://groups.google.com/group/sage-devel.
For more options, visit https://groups.google.com/d/optout.


Re: [sage-devel] Re: Call qsort from cython code .... with an inlined comparison function ?!

2014-11-23 Thread Christopher Swenson
There's a sort.h library you should be able to include that will have quick
sort + many others, and will allow the compiler to properly inline
functions. https://github.com/swenson/sort (disclaimer: I wrote it). I get
about a 10x speed improvement over qsort just for the ability to inline
functions.

I've never used it with Cython, but I don't believe there should be any
problems doing so.
On Nov 23, 2014 10:11 AM, Francesco Biscani bluesca...@gmail.com wrote:

 On 23 November 2014 at 19:05, Thierry Dumont tdum...@math.univ-lyon1.fr
 wrote:

 Is gprof enough powerful with modern architectures on such programs? from
 my point of view, no.
 There are non free, commercial, tools like vtune which can do fantastic
 measurement job. Vtune shows, for example, that a call to std::copy is not
 as fast as a for loop, which is turned by the compiler in a memcopy
 (probably std::copy is not!). I do not think we can see this with gprof.
 But ok, you are not supposed to buy vtune...


 I would be surprised if any modern c++ library implementation does not
 have specialisations of std::copy for POD types that use memcpy() or some
 other trick.


 What about likwid https://code.google.com/p/likwid ? It is free. Did
 somebody used it to measure cython code performances?

 Likwid (and Vtune) have in common to use  performance counters on Intel
 and AMD processors (not sure for AMD with Vtune...).

 What is the size of what you are sorting ? If it is small enough to fit
 in the caches, and better in the L1 cache, you can possibly improve
 something with your modification, but otherwise it is certainly memory
 bounded and you cannot do much...
 You have to measure the bandwidth of your program. Vtune does this,
 possibly likwid too.


 I used callgrind() in the past with some success... I would like to try
 the google cpu profiler to see how it fares, but I haven't had the chance
 yet.

 --
 You received this message because you are subscribed to the Google Groups
 sage-devel group.
 To unsubscribe from this group and stop receiving emails from it, send an
 email to sage-devel+unsubscr...@googlegroups.com.
 To post to this group, send email to sage-devel@googlegroups.com.
 Visit this group at http://groups.google.com/group/sage-devel.
 For more options, visit https://groups.google.com/d/optout.


-- 
You received this message because you are subscribed to the Google Groups 
sage-devel group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to sage-devel+unsubscr...@googlegroups.com.
To post to this group, send email to sage-devel@googlegroups.com.
Visit this group at http://groups.google.com/group/sage-devel.
For more options, visit https://groups.google.com/d/optout.


Re: [sage-devel] Re: Call qsort from cython code .... with an inlined comparison function ?!

2014-11-23 Thread Thierry Dumont

Le 23/11/2014 19:07, Francesco Biscani a écrit :

On 23 November 2014 at 18:07, Volker Braun vbraun.n...@gmail.com
mailto:vbraun.n...@gmail.com wrote:

C++ std::sort will be able to inline the comparator.



Just look at the assembly code:-)


+1

std::sort() will do exactly what you describe, only in a type-safe and
compiler-checked automatic way.

--
You received this message because you are subscribed to the Google
Groups sage-devel group.
To unsubscribe from this group and stop receiving emails from it, send
an email to sage-devel+unsubscr...@googlegroups.com
mailto:sage-devel+unsubscr...@googlegroups.com.
To post to this group, send email to sage-devel@googlegroups.com
mailto:sage-devel@googlegroups.com.
Visit this group at http://groups.google.com/group/sage-devel.
For more options, visit https://groups.google.com/d/optout.


--
You received this message because you are subscribed to the Google Groups 
sage-devel group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to sage-devel+unsubscr...@googlegroups.com.
To post to this group, send email to sage-devel@googlegroups.com.
Visit this group at http://groups.google.com/group/sage-devel.
For more options, visit https://groups.google.com/d/optout.
attachment: tdumont.vcf

Re: [sage-devel] Re: Call qsort from cython code .... with an inlined comparison function ?!

2014-11-23 Thread Thierry Dumont

Le 23/11/2014 19:09, Nathann Cohen a écrit :

Hello !


What about likwid https://code.google.com/p/likwid ? It is free. Did
somebody used it to measure cython code performances?


Never tried vtune, nor likwid.


What is the size of what you are sorting ? If it is small enough to fit in
the caches, and better in the L1 cache, you can possibly improve something
with your modification, but otherwise it is certainly memory bounded and you
cannot do much...


I sort many small arrays. Several kB, not more.


You have to measure the bandwidth of your program. Vtune does this, possibly
likwid too.


Oh. Do you think that it can be used from outside the program, i.e. on
a running Sage session ?


With Vtune it is possible (you start and stop collection with a call to 
the library (do not imagine I am paid by Intel :-) ,I am not.). With 
likwid, I don t know.
But, ok, it seems your objects can stay in the L1 cache. In that case, 
there is a possibility to improve performances by doing what you propose 
to do. I had recently a problem  like yours: an ode solver needs to 
solve linear systems, and I got a slight improvement by replacing the 
call to lapack by my own routine, probably because the lapack routine I 
called (which is called at many places in the program) imposes a far 
branch at each call... and for small systems (size 5), it is penalizing.
But what about the quick sort? is it sure that the implementation cannot 
degenerate? it is well known all the efficiency can be lost if the key 
used for partition is not chosen  as it should be... What about 
replacing the quick sort by an other method ? (the tree based one?).


t.d.



Nathann



--
You received this message because you are subscribed to the Google Groups 
sage-devel group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to sage-devel+unsubscr...@googlegroups.com.
To post to this group, send email to sage-devel@googlegroups.com.
Visit this group at http://groups.google.com/group/sage-devel.
For more options, visit https://groups.google.com/d/optout.
attachment: tdumont.vcf

Re: [sage-devel] Re: Call qsort from cython code .... with an inlined comparison function ?!

2014-11-23 Thread Jeroen Demeyer

On 2014-11-23 19:05, Thierry Dumont wrote:

Vtune shows, for example, that a call to std::copy is
not as fast as a for loop, which is turned by the compiler in a memcopy
(probably std::copy is not!).

If that's the case, get a better C++ compiler.

--
You received this message because you are subscribed to the Google Groups 
sage-devel group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to sage-devel+unsubscr...@googlegroups.com.
To post to this group, send email to sage-devel@googlegroups.com.
Visit this group at http://groups.google.com/group/sage-devel.
For more options, visit https://groups.google.com/d/optout.


Re: [sage-devel] Re: Call qsort from cython code .... with an inlined comparison function ?!

2014-11-23 Thread Francesco Biscani
On 23 November 2014 at 20:41, Thierry Dumont tdum...@math.univ-lyon1.fr
wrote:

 But what about the quick sort? is it sure that the implementation cannot
 degenerate? it is well known all the efficiency can be lost if the key
 used for partition is not chosen  as it should be... What about replacing
 the quick sort by an other method ? (the tree based one?).


qsort is not necessarily quick sort. AFAIK, the C/C++ libraries only
guarantee that the sorting has logarithmic complexity in the number of
comparisons.

std::sort is often implemented as a mix of introsort and other sorting
algorithms, with special casing for small/big objects, small/large number
of items to be sorted, etc.

  F.

-- 
You received this message because you are subscribed to the Google Groups 
sage-devel group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to sage-devel+unsubscr...@googlegroups.com.
To post to this group, send email to sage-devel@googlegroups.com.
Visit this group at http://groups.google.com/group/sage-devel.
For more options, visit https://groups.google.com/d/optout.


Re: [sage-devel] Re: Call qsort from cython code .... with an inlined comparison function ?!

2014-11-23 Thread Thierry Dumont

Le 23/11/2014 20:53, Jeroen Demeyer a écrit :

On 2014-11-23 19:05, Thierry Dumont wrote:

Vtune shows, for example, that a call to std::copy is
not as fast as a for loop, which is turned by the compiler in a memcopy
(probably std::copy is not!).

If that's the case, get a better C++ compiler.


heum... it is icc :-)
A c++ compiler as some unpredicable aspects...

--
You received this message because you are subscribed to the Google Groups 
sage-devel group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to sage-devel+unsubscr...@googlegroups.com.
To post to this group, send email to sage-devel@googlegroups.com.
Visit this group at http://groups.google.com/group/sage-devel.
For more options, visit https://groups.google.com/d/optout.
attachment: tdumont.vcf

Re: [sage-devel] Re: Call qsort from cython code .... with an inlined comparison function ?!

2014-11-23 Thread Francesco Biscani
icc is a pretty garbage C++ compiler, unless you are doing exactly the type
of linear algebra operations that they optimise to death (on Intel
processors at least :) for benchmarketing.

On 23 November 2014 at 21:53, Thierry Dumont tdum...@math.univ-lyon1.fr
wrote:

 Le 23/11/2014 20:53, Jeroen Demeyer a écrit :

 On 2014-11-23 19:05, Thierry Dumont wrote:

 Vtune shows, for example, that a call to std::copy is
 not as fast as a for loop, which is turned by the compiler in a memcopy
 (probably std::copy is not!).

 If that's the case, get a better C++ compiler.

  heum... it is icc :-)
 A c++ compiler as some unpredicable aspects...


 --
 You received this message because you are subscribed to the Google Groups
 sage-devel group.
 To unsubscribe from this group and stop receiving emails from it, send an
 email to sage-devel+unsubscr...@googlegroups.com.
 To post to this group, send email to sage-devel@googlegroups.com.
 Visit this group at http://groups.google.com/group/sage-devel.
 For more options, visit https://groups.google.com/d/optout.


-- 
You received this message because you are subscribed to the Google Groups 
sage-devel group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to sage-devel+unsubscr...@googlegroups.com.
To post to this group, send email to sage-devel@googlegroups.com.
Visit this group at http://groups.google.com/group/sage-devel.
For more options, visit https://groups.google.com/d/optout.