Re: [python-uk] C is it faster than numpy

Michael Fri, 25 Feb 2022 17:34:17 -0800

Hi,

On Fri, 25 Feb 2022 at 16:22, BELAHCENE Abdelkader <
abdelkader.belahc...@enst.dz> wrote:

> Hi,
> What do you mean?
> the python and C programs are not equivalent?
>
>
The python and C programs are NOT equivalent. (surface level they are: they
both calculate the triangular number of N, N times, inefficiently and
slowly, but the way they do it is radically different)

Let's specifically compare the num.py and the C versions. The pure python
version *should* always be slower, so it's irrelevant here. (NB, I show a
version below where the pure python version is quicker than both :-) )

The num.py version:

* It creates an array containing the numbers 1 to n. It then call's
num.py's sum function with that array n times.

The C version :
* Calls a function n times. That function has a tight loop using a long
that sums all the numbers from 1 to n.

These are *very* different operations. The former can be vectorised, and
while I don't use num.py, and while I'm not a betting person I would bet a
Mars bar that the num.py version will throw the array at a SIMD
implementation. A cursory look at the source to numpy does indeed show that
this is the case (fx: grabs a mars bar :) ), and not only that it's
optimised for a wide variety of CPU architectures.
https://github.com/numpy/numpy/tree/main/numpy/core/src/common/simd

If you don't know what that means, take a look at this --
https://en.wikipedia.org/wiki/Streaming_SIMD_Extensions numpy however
supports multiple versions of SIMD - including things like this:
https://en.wikipedia.org/wiki/AVX-512 - which is pretty neat.

Upshot - numpy will /essentially/ just throw the entire array at the CPU
and say "add these, give me the result". And it'll be done. (OK there's a
bit more to it, but that's the principle)

By contrast your naive C version has to repeatedly create stack frames,
push arguments onto it, pop arguments allocate memory on the stack, etc.
It's also single threaded, and has no means of telling the CPU "add all
these together, and I don't care how", so it literally is one after the
other. The actual work it's doing to get the same answer is simply much
greater and many many more clock cycles.

If you took the same approach as numpy it's *possible* you *might* get
something similarly fast. (But you'd have a steep learning curve and you'd
likely only optimise for one architecture...)

Aside: What size is the *value*  the C version is creating? Well it's just
(n * n+1 ) /2  - primary school triangular number stuff. So 50K
is 1250025000 - which is a 31 bit number. So the C version will fail on a
32 bit machine as soon as you try over 65536 as the argument... (On a
64bit machine admittedly the fall over number is admittedly higher... :-) )

C and C++ *are* faster than pure python. That's pretty much always going to
be the case ( *except* for a specialising python compiler that compiles a
subset of python to either C/C++/Rust or assembler). However, python +
highly optimised C/C++/etc libraries will likely outperform naive C/C++
code - as you've ably demonstrated here.

Why? Because while on the surface the two programs are vaguely similar -
calculate the triangular number of N, N times.  In practice, the way that
they do it is so different, you get dramatically different results.

As a bonus - you can get faster than both your num.py and C versions with
pure python, by several orders of magnitude. You can then squeeze out a few
more percent out of it.  Output from a pure python version of a piece of
code that performs the same operation - calculates the triangle number of
N, N times:

michael@conceptual:~$ ./triangles.py 15000000
time using pure python: 1.81 sec
time using pure python & memoisation: 1.7 sec

Note that's 15 million, not 50,000 - and it took the same time for my
machine (recent core i7) as numpy did for you on your machine for just
50,000. That's not because I have a faster machine (I expect I don't).

What's different? Well the pure python version is this - it recognises you
were calculating  the triangular number for N and just calculates that
instead :-)

def normaltriangle(n):
   return (n*(n+1))/2
... called like this ..
   tm1=it.timeit(stmt=lambda: normaltriangle(count), number=count)
   print(f"time using pure python: {round(tm1,2)} sec")

The (naive) memoisation version is this:

def memoise(f):
   cache = {}
   def g(n):
       if n in cache:
           return cache[n]
       else:
           X = f(n)
           cache[n] = X
           return X
   return g

@memoise
def triangle(n):
   return (n*(n+1))/2

... called like this ...
   tm2=it.timeit(stmt=lambda: triangle(count), number=count)
   print(f"time using pure python & memoisation: {round(tm2,2)} sec")

Original file containing both:

#!/usr/bin/python3

import timeit as it

def memoise(f):
   cache = {}
   def g(n):
       if n in cache:
           return cache[n]
       else:
           X = f(n)
           cache[n] = X
           return X
   return g

@memoise
def triangle(n):
   return (n*(n+1))/2

def normaltriangle(n):
   return (n*(n+1))/2

if __name__ == "__main__":
   import sys
   count = int(sys.argv[1])
   tm1=it.timeit(stmt=lambda: normaltriangle(count), number=count)
   print(f"time using pure python: {round(tm1,2)} sec")

   tm2=it.timeit(stmt=lambda: triangle(count), number=count)
   print(f"time using pure python & memoisation: {round(tm2,2)} sec")

I guess the point here is this: focussing on the language/tool C vs python
ignores how "well" the C code is written and how "well" the python code is
written.

However worrying about *that* rather than "is there a smarter way of doing
this" means you can sometimes miss the 300 fold speed up (vs numpy) or 6000
fold speed up (vs your python). I must admit as well, I was a little
surprised that the memoised version worked even this well with such simple
code.

But the point is when comparing C to numpy, you're really comparing *your*
C with *numpy's* C code, and as you can see from their code, your repeated
calls are very different from a vectorised call to CPU processor extensions
designed specifically for this sort of work...

Gets even sillier (in a fun way :) ) if you can throw the code at several
hundred GPU cores...

Hope this was vaguely interesting...

Michael.

On Fri, 25 Feb 2022 at 16:22, BELAHCENE Abdelkader <
abdelkader.belahc...@enst.dz> wrote:

> Hi,
> What do you mean?
> the python and C programs are not equivalent?
>
>
> Le ven. 25 févr. 2022 à 10:42, Giorgio Zoppi <giorgio.zo...@gmail.com> a
> écrit :
>
>> Well,
>> numpy is written in C :) Maybe your C is not the numpy equivalent?
>> Best Regards,
>> Giorgio
>>
>> Il giorno ven 25 feb 2022 alle ore 09:03 BELAHCENE Abdelkader <
>> abdelkader.belahc...@enst.dz> ha scritto:
>>
>>> Hi,
>>> a lot of people think that C (or C++) is faster than python, yes I
>>> agree, but I think that's not the case with numpy, I believe numpy is
>>> faster than C, at least in some cases.
>>>
>>>
>>> *Is there another explanation ?Or where can find  a doc speaking  about
>>> the subject?*Thanks a lot
>>> Regards
>>> Numpy implements vectorization for arrays, or I'm wrong. Anyway here is
>>> an example Let's look at the following case:
>>> Here is the result on my laptop i3:
>>>
>>> Labs$ *python3 tempsExe.py  50000*
>>>   sum with Python: 1250025000 and NumPy 1250025000
>>>       time used Python Sum: * 37.28 sec *
>>>       time used  Numpy Sum:  *1.85 sec*
>>>
>>> Labs$ *./tt    50000 *
>>>
>>>
>>> *   CPU  time :7.521730    The value : 1250025000 *
>>> --------------------------------------------
>>>
>>> This is the Python3 program :
>>>
>>> import timeit as it
>>> import numpy as np
>>> import sys
>>> try :
>>> n=eval(sys.argv[1])
>>> except:
>>> print ("needs integer as argument") ; exit()
>>>
>>> a=range(1,n+1)
>>> b=np.array(a)
>>> def func1():     return sum(a)
>>> def func2(): return np.sum(b)
>>>
>>> print(f"sum with Python: {func1()} and NumPy {func2()} ")
>>> tm1=it.timeit(stmt=func1, number=n)
>>> print(f"time used Python Sum: {round(tm1,2)} sec")
>>> tm2=it.timeit(stmt=func2, number=n)
>>> print(f"time used  Numpy Sum: {round(tm2,2)} sec")
>>>
>>> and Here the C program:
>>> #include <time.h>
>>> #include <stdio.h>
>>> #include <stdlib.h>
>>> long func1(int n){
>>>          long  r=0;
>>>         for (int  i=1; i<= n;i++) r+= i;
>>>          return r;
>>> }
>>> int main(int argc, char* argv[]){
>>>          clock_t c0, c1;
>>>         long v,count; int n;
>>>        if ( argc < 2) {
>>>               printf("Please give an argument");
>>>              return -1;
>>>       }
>>>     n=atoi(argv[1]);
>>>     c0 = clock();
>>>      *for (int j=0;j < n;j++) v=func1(n);*
>>>      c1 = clock();
>>>      printf ("\tCPU  time :%.2f sec", (float)(c1 - c0)/CLOCKS_PER_SEC);
>>>      printf("\n\tThe value : %ld\n",  v);
>>> }
>>> _______________________________________________
>>> python-uk mailing list
>>> python-uk@python.org
>>> https://mail.python.org/mailman/listinfo/python-uk
>>>
>>
>>
>> --
>> Life is a chess game - Anonymous.
>> _______________________________________________
>> python-uk mailing list
>> python-uk@python.org
>> https://mail.python.org/mailman/listinfo/python-uk
>>
> _______________________________________________
> python-uk mailing list
> python-uk@python.org
> https://mail.python.org/mailman/listinfo/python-uk
>

_______________________________________________
python-uk mailing list
python-uk@python.org
https://mail.python.org/mailman/listinfo/python-uk

Re: [python-uk] C is it faster than numpy

Reply via email to