subject:"Speed\-up for loops"

Re: Speed-up for loops

2010-09-09 Thread BartC




"Stefan Behnel"  wrote in message
news:mailman.563.1283921317.29448.python-l...@python.org...

BartC, 08.09.2010 03:45:

Getting back to the OP's code again (trivial and pointless as it might
seem), I got these results:

C (gcc 3.4.5 -O3) 0.8 secs
C (DMC-o) 2.3 secs
C (lccwin32 -O) 2.9 secs



BTW, I wonder why the code takes a whole 0.8 seconds to run in your gcc
test. Maybe you should use a newer GCC version. It shouldn't take more
than a couple of milliseconds (for program startup, OS calls, etc.), given
that the output is a constant.


It turns out the 0.8 second version cheats. It counts to a billion, but does
not do the work asked (a=a+10).

The fastest time that will do what it is requested, is about 1.2 seconds,
when register-based, and about 3 seconds when memory-based.

This reduces the slowdown on Python 3 from 220x to 150x, and down to 60x
when register-versions are not counted (as use of registers can't be
guaranteed). (Python 2 was about 20% faster.)

--
Bartc



--
http://mail.python.org/mailman/listinfo/python-list

Re: Speed-up for loops

2010-09-08 Thread BartC


"Steven D'Aprano"  wrote in message
news:4c878be5$0$3$c3e8...@news.astraweb.com...

On Tue, 07 Sep 2010 11:00:03 +0100, BartC wrote:


for i in xrange(1):
   a = a + f(i)



With Python 3 and def f(x): return x+1, unrolling this loop 4x improved
speed by 15%; 4.00 minutes reduces to 3.30 minutes.



I'm afraid that I can't replicate those figures. In my test, unrolling
the loop causes a massive SLOWDOWN of 37%, not a speed up. Here is my
test code:


You're absolutely right. I completely forgot about modulating the i index
for each duplicated line.


def test_unrolled(n):
   a = 0
   for i in range(n//4):
   a += f(4*i)
   a += f(4*i+1)
   a += f(4*i+2)
   a += f(4*i+3)
   return a


Although tidying up your calculations (as already posted) gives code that is
neither faster nor slower.

I'd hoped that using the following would help, but did nothing in Python 3,
and gave only 8-10% speedup in Python 2:

for i in xrange(0,n,4):
   a=a+f(i)
   a=a+f(i+1)
   a=a+f(i+2)
   a=a+f(i+3)

(On my other example of setting list elements to 0, this did improve speed
by some 10% in Python 3, and 28% in '2'.)

So using manual unrolling for an indexed loop is not going to be the right
approach (it's also fiddly, and there is the problem of N not being a 
multiple of 4 or whatever).


We're trying to avoid the iteration overhead, yet we're adding it in the
code anyway, and as user-level code. But, I still think that internal
support for such a thing might be worthwhile, when it can make certain
assumptions about the loop range and index type.

--
Bartc 


--
http://mail.python.org/mailman/listinfo/python-list

Re: Speed-up for loops

2010-09-08 Thread Alain Ketterlin

Steven D'Aprano  writes:

>> With Python 3 and def f(x): return x+1, unrolling this loop 4x improved
>> speed by 15%; 4.00 minutes reduces to 3.30 minutes.

> I'm afraid that I can't replicate those figures. In my test, unrolling 
> the loop causes a massive SLOWDOWN of 37%, not a speed up. Here is my 
> test code:
>
> def f(x):
> return x+1
>
> def test_loop(n):
> a = 0
> for i in range(n):
> a += f(i)
> return a
>
> def test_unrolled(n):
> a = 0
> for i in range(n//4):
> a += f(4*i)
> a += f(4*i+1)
> a += f(4*i+2)
> a += f(4*i+3)
> return a
>
> from timeit import Timer
> n = 1000
> assert test_loop(n) == test_unrolled(n)
> t1 = Timer('test_loop(n)', 
>   'from __main__ import test_loop; n=%d' % n)
> t2 = Timer('test_unrolled(n)', 
>   'from __main__ import test_unrolled; n=%d' % n)
>
> And here are the results using Python 3.1. Times are the best of 5 for 10 
> calls each to the loop_* functions, results given in seconds:
>
 min(t1.repeat(number=10, repeat=5))/10
> 5.9740923209
 min(t2.repeat(number=10, repeat=5))/10
> 8.25656900405883

Running this test with python 2.6 (on my laptop) led to:

>>> min(t1.repeat(number=10, repeat=5))/10
2.10715539455
>>> min(t2.repeat(number=10, repeat=5))/10
2.43037149906

That's a 15% slowdown. Which is reasonable since you add four multiplies
in the loop body. Changing your unrolled loop to:

def test_unrolled(n):
a = 0
for i in range(n//4):
b = 4*i
a += f(b)
a += f(b+1)
a += f(b+2)
a += f(b+3)
return a

makes both versions run in approximately the same time (2.135 vs.
2.136).

> That's slightly better (32% slower instead of 37% slower), but still a 
> massive performance hit. Given these results, I'm prepared to say that 
> loop unrolling in Python is almost certainly going to be a pessimation, 
> not an optimization, no matter what you have inside the loop.

I don't really see why it should be the case. Do you have any idea?

I don't think either that it should speed things up significantly. Loop
unrolling in binary code is relevant mostly because it allows better
instruction scheduling (i.e., scheduling has fewer constraints in
longer instruction sequences). Python programs are way too far from
binary code for scheduling opts to apply.

-- Alain.
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Speed-up for loops

2010-09-08 Thread Steven D'Aprano

On Tue, 07 Sep 2010 11:00:03 +0100, BartC wrote:

>> for i in xrange(1):
>>a = a + f(i)
>>
>> then unrolling the loop is even less useful. The overhead of the loop
>> itself is likely to be trivial compared to the cost of calling f() 100
>> million times -- the added complexity to shave 3 seconds off a four
>> minute calculation simply isn't worth it.
> 
> With Python 3 and def f(x): return x+1, unrolling this loop 4x improved
> speed by 15%; 4.00 minutes reduces to 3.30 minutes.

I'm afraid that I can't replicate those figures. In my test, unrolling 
the loop causes a massive SLOWDOWN of 37%, not a speed up. Here is my 
test code:

def f(x):
return x+1

def test_loop(n):
a = 0
for i in range(n):
a += f(i)
return a

def test_unrolled(n):
a = 0
for i in range(n//4):
a += f(4*i)
a += f(4*i+1)
a += f(4*i+2)
a += f(4*i+3)
return a

from timeit import Timer
n = 1000
assert test_loop(n) == test_unrolled(n)
t1 = Timer('test_loop(n)', 
  'from __main__ import test_loop; n=%d' % n)
t2 = Timer('test_unrolled(n)', 
  'from __main__ import test_unrolled; n=%d' % n)

And here are the results using Python 3.1. Times are the best of 5 for 10 
calls each to the loop_* functions, results given in seconds:

>>> min(t1.repeat(number=10, repeat=5))/10
5.9740923209
>>> min(t2.repeat(number=10, repeat=5))/10
8.25656900405883

I repeated it with a larger loop variable. Since the time per test was so 
large (over ten minutes per call on my machine!), I didn't see the need 
to run multiple trials:

n *= 100
assert test_loop(n) == test_unrolled(n)
t3 = Timer('test_loop(n)', 
  'from __main__ import test_loop; n=%d' % n)
t4 = Timer('test_unrolled(n)', 
  'from __main__ import test_unrolled; n=%d' % n)

And the results:

>>> t3.timeit(number=1)
653.3572518825531
>>> t4.timeit(number=1)
864.6454808712006

That's slightly better (32% slower instead of 37% slower), but still a 
massive performance hit. Given these results, I'm prepared to say that 
loop unrolling in Python is almost certainly going to be a pessimation, 
not an optimization, no matter what you have inside the loop.

-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Speed-up for loops

2010-09-08 Thread Ulrich Eckhardt

BartC wrote:
> So 'range' is just a class like any other. And that a class is something
> you can blithely copy from one variable to another. And whenever you see
> 'range' anywhere, you can't always be certain that someone hasn't done:
> 
> range = 42
> 
> at some point.

True. I read an explanation here that IMHO pretty well explains what's going
on: The "range" above is a label, attached with a piece of string to an
object. The object in this case is the integer 42. The same object can have
multiple labels attached to it, so "foo = bar" will just create a new
label "foo" and attach it to the object that the label "bar" is already
attached to.

Note that I said object, which includes class instances like the int 42
above, but also the classes themselves, functions (or bound functions[1])
and modules (I hope I didn't miss any).

> That explains a lot about the difficulties of implementing Python
> efficiently.

Yes, "range" is not a keyword or something intrinsic to the interpreter. It
is just a name that is looked up whenever it is encountered, and that costs
time. However, you also get some nice flexibility at that cost.

Cheers!

Uli

[1] bound function = class function where the instance parameter is bound.
Example:

  x = []
  a = x.append
  a(42) # calls x.append(42)

-- 
Sator Laser GmbH
Geschäftsführer: Thorsten Föcking, Amtsgericht Hamburg HR B62 932

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Speed-up for loops

2010-09-07 Thread Stefan Behnel


BartC, 08.09.2010 03:45:

Getting back to the OP's code again (trivial and pointless as it might
seem), I got these results:

C (gcc 3.4.5 -O3) 0.8 secs
C (DMC-o) 2.3 secs
C (lccwin32 -O) 2.9 secs
[...]
I've seen LuaJIT in action. It's timing for this test is 1.5 secs:
forget being only 10x slower than C, it's faster than some C versions!
(I'm sure it must be cheating somewhere...)


Sure it does. C is statically compiled, while LuaJIT is a JIT compiler. It 
unjustly uses *runtime* information to compile the code. You can get a 
similar advantage with some C compilers by using profile based optimisation.


BTW, I wonder why the code takes a whole 0.8 seconds to run in your gcc 
test. Maybe you should use a newer GCC version. It shouldn't take more than 
a couple of milliseconds (for program startup, OS calls, etc.), given that 
the output is a constant.


Stefan

--
http://mail.python.org/mailman/listinfo/python-list

Re: Speed-up for loops

2010-09-07 Thread MRAB


On 08/09/2010 02:45, BartC wrote:



"David Cournapeau"  wrote in message
news:mailman.546.1283897932.29448.python-l...@python.org...

On Sun, Sep 5, 2010 at 8:28 PM, BartC  wrote:



One order of magnitude (say 10-20x slower) wouldn't be so bad. That's
what
you might expect for a dynamically typed, interpreted language.


10/20x slower than C is only reached by extremely well optimized
dynamic languages. It would be a tremendous achievement. If that's


Well, that is what I do (mess around with languages and stuff).

Getting back to the OP's code again (trivial and pointless as it might
seem), I got these results:

C (gcc 3.4.5 -O3) 0.8 secs
C (DMC-o) 2.3 secs
C (lccwin32 -O) 2.9 secs
My last interpreter 12.6 secs dynamically typed language
(or 4.5 secs when told the type of 'a'; but that's
cheating a little..)
Python 3 177.0 secs

That's why I was questioning the latter's performance in for-loops. But now
that I know a bit more about Python (having dynamic everything) the figure
is not so surprising. However, it's still slow!


what you are after, look at LUA with its JIT, or scheme + stalin.


I've seen LuaJIT in action. It's timing for this test is 1.5 secs:
forget being only 10x slower than C, it's faster than some C versions!
(I'm sure it must be cheating somewhere...)


If you'd like to direct your skills to making CPython faster, without
diminishing its flexibility, I'm sure it'll be welcomed. The source is
all public, you know! :-)
--
http://mail.python.org/mailman/listinfo/python-list

Re: Speed-up for loops

2010-09-07 Thread BartC




"David Cournapeau"  wrote in message
news:mailman.546.1283897932.29448.python-l...@python.org...

On Sun, Sep 5, 2010 at 8:28 PM, BartC  wrote:



One order of magnitude (say 10-20x slower) wouldn't be so bad. That's
what
you might expect for a dynamically typed, interpreted language.


10/20x slower than C is only reached by extremely well optimized
dynamic languages. It would be a tremendous achievement. If that's


Well, that is what I do (mess around with languages and stuff).

Getting back to the OP's code again (trivial and pointless as it might
seem), I got these results:

C (gcc 3.4.5 -O3) 0.8 secs
C (DMC-o) 2.3 secs
C (lccwin32 -O)   2.9 secs
My last interpreter  12.6 secs dynamically typed language
  (or4.5 secs when told the type of 'a'; but that's
cheating a little..)
Python 3177.0 secs

That's why I was questioning the latter's performance in for-loops. But now
that I know a bit more about Python (having dynamic everything) the figure
is not so surprising. However, it's still slow!


what you are after, look at LUA with its JIT, or scheme + stalin.


I've seen LuaJIT in action. It's timing for this test is 1.5 secs: forget 
being only 10x slower than C, it's faster than some C versions! (I'm sure it 
must be cheating somewhere...)


--
bartc 


--
http://mail.python.org/mailman/listinfo/python-list

Re: Speed-up for loops

2010-09-07 Thread Terry Reedy


On 9/7/2010 6:00 AM, BartC wrote:


Why should it? But if you want it, you can do it:

xrange = range

There, that wasn't hard, was it?


I think I just learned more about Python than from months of reading this
group.

So 'range' is just a class like any other. And that a class is something
you
can blithely copy from one variable to another.


There is no copying of the class object. A new name is associated with 
the object. Any object can be given any number of names (aliases) that 
refer to the one and same object.


--
Terry Jan Reedy

--
http://mail.python.org/mailman/listinfo/python-list

Re: Speed-up for loops

2010-09-07 Thread David Cournapeau

On Sun, Sep 5, 2010 at 8:28 PM, BartC  wrote:

>
> One order of magnitude (say 10-20x slower) wouldn't be so bad. That's what
> you might expect for a dynamically typed, interpreted language.

10/20x slower than C is only reached by extremely well optimized
dynamic languages. It would be a tremendous achievement. If that's
what you are after, look at LUA with its JIT, or scheme + stalin.

For cases where vectorization is indeed not applicable (recursive
algorithms), like in some signal processing, there are specialized
tools which are very expressive while retaining good speed (faust is
an interesting one for audio signal processing).

> That would simply be delegating Python to a scripting language.

That's a strange thing to say if you compare it to matlab.

> It would be
> nice if you could directly code low-level algorithms in it without relying
> on accelerators

It would be nice, but the fact is that python cannot do it  - and is
quite far from being able to do it. I don't think it is as important
as you think it is, because things like numpy are extremely powerful
in many cases.

cheers,

David
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Speed-up for loops

2010-09-07 Thread Aahz

In article ,
Roy Smith   wrote:
>
>Imagine that you're looking at some code which was written years ago, by 
>people who are no longer around to answer questions.  In one place, you 
>see:
>
>for i in range(n):
>   blah
>
>and in another, you see:
>
>for j in xrange(n):
>   blah
>
>If you are truly a Python expert, you'll say to yourself, "range and 
>xrange are synonyms", and move on to other things.  If, however, you're 
>not really an expert, you'll look at this and say, "Hmmm, in one place 
>they used range(), and in another they used xrange().  Clearly, there's 
>some reason for the difference, I need to figure out what it is, because 
>that's probably key to my understanding why this code isn't working".  
>So, you spend the next two hours pouring over reference manuals trying 
>to understand the subtle difference, until your friend comes over and 
>says, "You dolt, you just wasted half the afternoon.  They're the same 
>thing.  Move on, this is not the bug you're looking for".

...and if you're a Python guru, you might spend a little bit of time
trying to figure out if the range() is causing the problem due to
allocating a large chunk of memory
-- 
Aahz (a...@pythoncraft.com)   <*> http://www.pythoncraft.com/

"...if I were on life-support, I'd rather have it run by a Gameboy than a
Windows box."  --Cliff Wells
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Speed-up for loops

2010-09-07 Thread Aahz

In article ,
BartC  wrote:
>"Steven D'Aprano"  wrote in message
>news:4c85adfe$0$5$c3e8...@news.astraweb.com...
>> 
>> xrange = range
>>
>> There, that wasn't hard, was it?
>
>I think I just learned more about Python than from months of reading this
>group.
>
>So 'range' is just a class like any other. And that a class is something you
>can blithely copy from one variable to another. And whenever you see 'range'
>anywhere, you can't always be certain that someone hasn't done:
>
>range = 42
>
>at some point. That explains a lot about the difficulties of implementing
>Python efficiently. (And the xrange=range trick works well thanks.)

Actually, range() is a function.  But the same point applies, squared --
you really can never know what kind of object is hiding behind a name in
the general case.
-- 
Aahz (a...@pythoncraft.com)   <*> http://www.pythoncraft.com/

"...if I were on life-support, I'd rather have it run by a Gameboy than a
Windows box."  --Cliff Wells
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Speed-up for loops

2010-09-07 Thread Roy Smith

In article ,
 "BartC"  wrote:

> (BTW why doesn't Python 3 just accept 'xrange' as a 
> synonym for 'range'?)

If you've ever tried to maintain a legacy code base, you'll understand 
why "there's only one way to do it" is A Good Thing.

Imagine that you're looking at some code which was written years ago, by 
people who are no longer around to answer questions.  In one place, you 
see:

for i in range(n):
   blah

and in another, you see:

for j in xrange(n):
   blah

If you are truly a Python expert, you'll say to yourself, "range and 
xrange are synonyms", and move on to other things.  If, however, you're 
not really an expert, you'll look at this and say, "Hmmm, in one place 
they used range(), and in another they used xrange().  Clearly, there's 
some reason for the difference, I need to figure out what it is, because 
that's probably key to my understanding why this code isn't working".  
So, you spend the next two hours pouring over reference manuals trying 
to understand the subtle difference, until your friend comes over and 
says, "You dolt, you just wasted half the afternoon.  They're the same 
thing.  Move on, this is not the bug you're looking for".

If you think stuff like that can't happen, you've probably never jumped 
into the middle of a legacy code base written in a language you don't 
know very well.
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Speed-up for loops

2010-09-07 Thread BartC


"Steven D'Aprano"  wrote in message
news:4c85adfe$0$5$c3e8...@news.astraweb.com...

On Mon, 06 Sep 2010 11:38:22 +0100, BartC wrote:




Manually unrolling such a loop four times (ie. 4 copies of the body, and
counting only to 25 million) increased the speed by between 16% and 47%
(ie. runtime reducing by between 14% and 32%).


Or you could *really* optimize it, as well as simplifying the code, by
writing:

n = 1
a = 10*n

and doing a single multiplication rather than pointless repeated addition.


It's not pointless; it's being used as a test of long it takes to do a=a+10
a hundred million times in a loop.

(Similarly, if we wanted to investigate the speed of function calls, then a
'pointless' recursive implementation of fibonacci(36) that needed 76 million
(or whatever) calls is far more useful than one that did it with just 1. But
a lot of people just don't get this, for example some of the comments here:

http://programmingzen.com/2007/11/28/holy-shmoly-ruby-19-smokes-python-away/
)


Of course, if 10 (or 100) is not a constant but is just standing in for
something which varies each iteration:

for i in xrange(1):
   a = a + f(i)

then unrolling the loop is even less useful. The overhead of the loop
itself is likely to be trivial compared to the cost of calling f() 100
million times -- the added complexity to shave 3 seconds off a four
minute calculation simply isn't worth it.


With Python 3 and def f(x): return x+1, unrolling this loop 4x improved
speed by 15%; 4.00 minutes reduces to 3.30 minutes.

In this code (adapted from a real piece of Python code):

for i in xrange(100):
   a[i]=0

Unrolling 4x produced a 47% speedup. Now your program only takes 2:45
minutes.


(BTW why doesn't Python 3 just accept 'xrange' as a synonym for 'range'?)

Why should it? But if you want it, you can do it:

xrange = range

There, that wasn't hard, was it?


I think I just learned more about Python than from months of reading this
group.

So 'range' is just a class like any other. And that a class is something you
can blithely copy from one variable to another. And whenever you see 'range'
anywhere, you can't always be certain that someone hasn't done:

range = 42

at some point. That explains a lot about the difficulties of implementing
Python efficiently. (And the xrange=range trick works well thanks.)


) Integer arithmetic seems to go straight from 32-bits to long

integers; why not use 64-bits before needing long integers?


Why? That adds 50% more code, 50% more testing, 50% more places for bugs
to hide, 50% more effort required to maintain it, for something which *at
best* will be a trivial optimization which at best is of interest to a
small minority of Python coders.


I would have suggested just using 64-bits anyway (on 32-bit hardware) as,
currently, the slight extra overhead would be insignificant, provided it
doesn't result in double the memory when there are many integers.

As for being of minority interest, with the whole world rapidly becoming
64-bits, I would dispute that...

>> repeat 1:# for example

a = a + 10


Because now both the parser and all Python coders need to care about one
more reserved word, all so that one Python program in ten thousand can
save 0.1 ms occasionally.


That's a minor detail. You can call it whatever you like.

--
Bartc



--
http://mail.python.org/mailman/listinfo/python-list

Re: Speed-up for loops

2010-09-06 Thread Steven D'Aprano

On Mon, 06 Sep 2010 11:38:22 +0100, BartC wrote:

> Modifying the OP's code a little:
> 
> a = 0
> for i in xrange(1):  # 100 million
>  a = a + 10  # add 10 or 100
> print a
> 
> Manually unrolling such a loop four times (ie. 4 copies of the body, and
> counting only to 25 million) increased the speed by between 16% and 47%
> (ie. runtime reducing by between 14% and 32%).

Or you could *really* optimize it, as well as simplifying the code, by 
writing:

n = 1
a = 10*n

and doing a single multiplication rather than pointless repeated addition.

(I assume that the number of loops is meant to be a variable rather than 
a constant, otherwise a = 10 would be the correct optimization.)

Of course, if 10 (or 100) is not a constant but is just standing in for 
something which varies each iteration:

for i in xrange(1):
a = a + f(i)

then unrolling the loop is even less useful. The overhead of the loop 
itself is likely to be trivial compared to the cost of calling f() 100 
million times -- the added complexity to shave 3 seconds off a four 
minute calculation simply isn't worth it.

Besides, loop unrolling really only is valuable for small loops, 
otherwise the overhead caused by the increased code size makes it a 
pessimation rather than an optimization. It's very easy for any gains to 
be lost due to increased cache misses, time needed to copy the code into 
memory, etc.

http://en.wikipedia.org/wiki/Loop_unwinding

There's a lot of subtlety in optimization, and what counts as an 
optimization for low-level operations, and what is an optimization for 
high-level languages like Python, are rarely the same.

> This depended on whether I added +10 or +100 (ie. whether long integers
> are needed), whether it was inside or outside a function, and whether I
> was running Python 2 or 3 (BTW why doesn't Python 3 just accept 'xrange'
> as a synonym for 'range'?)

Why should it? But if you want it, you can do it:

xrange = range

There, that wasn't hard, was it?

> These are just some simple tests on my particular machine and
> implementations, but they bring up some points:
> 
> (1) Loop unrolling does seem to have a benefit, when the loop body is
> small.
> 
> (2) Integer arithmetic seems to go straight from 32-bits to long
> integers; why not use 64-bits before needing long integers?

Why? That adds 50% more code, 50% more testing, 50% more places for bugs 
to hide, 50% more effort required to maintain it, for something which *at 
best* will be a trivial optimization which at best is of interest to a 
small minority of Python coders.

There's already code to deal with 32 bit its, code to deal with longints, 
and code to deal with shifting transparently from one to the other. 
Adding code to deal with 64 bit ints doesn't happen for free.

Besides, if you care about the difference between 32 and 64 bit ints, 
chances are you don't want Python blithely swapping from one to the other 
when you least expect it. So you'll be using a library that gives you 
access to whichever ints you want, probably implemented natively rather 
than as objects.

Sounds rather like numpy really :)

http://docs.scipy.org/doc/numpy/user/basics.types.html

> (3) Since the loop variable is never used, why not have a special loop
> statement that repeats code so many times? This can be very fast, since
> the loop counter need not be a Python object, and probably there would
> be no need for unrolling at all:
> 
> repeat 1:# for example
> a = a + 10

Because now both the parser and all Python coders need to care about one 
more reserved word, all so that one Python program in ten thousand can 
save 0.1 ms occasionally.

Optimizations don't happen for free. If the optimization doesn't carry 
it's own weight, it's a pessimation. To me, it's more important to be 
able to be able to use repeat as a name:

connect_to_server("localhost", repeat=10)

than to save a millisecond or so.

Besides, if repeat were to become syntax, then I'd MUCH rather have it 
used for repeat...until (or repeat...while) loops, to avoid the anti-
pattern of:

x = f()
while not condition(x):
x = f(x)

which would be better as:

repeat:
x = f()
until condition(x)

-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Speed-up for loops

2010-09-06 Thread Terry Reedy

On 9/6/2010 7:20 AM, Stefan Behnel wrote:

BartC, 06.09.2010 12:38:

(3) Since the loop variable is never used, why not have a special loop
statement that repeats code so many times?

There sort-of is, just slightly more general.

Because special cases are not special enough to break the rules. As
others have stated before, you can use itertools to reduce that part of
the looping overhead, if it really hurts your concrete code.

I ran the following test:
>>> from itertools import repeat
>>> n = 10**8
>>> for dummy in repeat(None,n): pass # 7 sec

>>> for dummy in range(n): pass # 11 sec

Times are for my older machine, subjectively counted. The difference is 
subjectively clear. Both functions count in C. I presume the difference 
is the cost of creating and deleting unneeded Python int objects. This 
difference is the reason the timeit module uses itertools.repeat for the 
inner loop that repeats the code to be timed.

--
Terry Jan Reedy

--
http://mail.python.org/mailman/listinfo/python-list

Re: Speed-up for loops

2010-09-06 Thread BartC

"Stefan Behnel"  wrote in message 
news:mailman.485.1283772019.29448.python-l...@python.org...

BartC, 06.09.2010 12:38:



(2) Integer arithmetic seems to go straight from 32-bits to long
integers; why not use 64-bits before needing long integers?


You are making assumptions based on Python 2, I guess. Try Python 3.1 or 
later instead, where the int and long types are unified. Also, the 
implementation is a bit more complex than you appear to be thinking. Don't 
forget that it has received serious benchmarking based optimisations.


That's true; wider arithmetic was less of an overhead in Python 3.

This shows the effect of making several small optimisations which might 
otherwise be dismissed: with the +100 test, the Python 3 faster wider 
arithmetic, *plus* the 4x loop unrolling, resulted in an 85% speed increase 
compared with Python 2 using the original loop. Which is pretty good 
considering Python 3 is generally slower than '2'.



This can be very fast, since
the loop counter need not be a Python object


It still has to count, though.


That might be just a couple of machine instructions. Plus the bytecode 
overhead.


--
Bartc 


--
http://mail.python.org/mailman/listinfo/python-list

Re: Speed-up for loops

2010-09-06 Thread Antoine Pitrou

On Mon, 06 Sep 2010 13:20:01 +0200
Stefan Behnel  wrote:
> 
> > (2) Integer arithmetic seems to go straight from 32-bits to long
> > integers; why not use 64-bits before needing long integers?
> 
> You are making assumptions based on Python 2, I guess. Try Python 3.1 or 
> later instead, where the int and long types are unified. Also, the 
> implementation is a bit more complex than you appear to be thinking. Don't 
> forget that it has received serious benchmarking based optimisations.

The optimizations are mainly related to big integer arithmetics,
though. For small ints the main cost is interpretation and
unboxing overhead as always.

Regards

Antoine.


-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Speed-up for loops

2010-09-06 Thread Kushal Kumaran

On Mon, Sep 6, 2010 at 4:08 PM, BartC  wrote:
> "Stefan Behnel"  wrote in message
> news:mailman.470.1283712666.29448.python-l...@python.org...
>>
>> BartC, 05.09.2010 19:09:
>
>>> All those compilers that offer loop unrolling are therefore wasting
>>> their time...
>>
>> Sometimes they do, yes.
>
> Modifying the OP's code a little:
>
> a = 0
> for i in xrange(1):      # 100 million
>    a = a + 10                  # add 10 or 100
> print a
>
> Manually unrolling such a loop four times (ie. 4 copies of the body, and
> counting only to 25 million) increased the speed by between 16% and 47% (ie.
> runtime reducing by between 14% and 32%).
>
> This depended on whether I added +10 or +100 (ie. whether long integers are
> needed), whether it was inside or outside a function, and whether I was
> running Python 2 or 3 (BTW why doesn't Python 3 just accept 'xrange' as a
> synonym for 'range'?)
>
> These are just some simple tests on my particular machine and
> implementations, but they bring up some points:
>
> (1) Loop unrolling does seem to have a benefit, when the loop body is small.
>
> (2) Integer arithmetic seems to go straight from 32-bits to long integers;
> why not use 64-bits before needing long integers?
>

On 64-bit systems, integer arithmetic will go from 64-bit native
integers to long.  Will using any emulated 64-bit type on a 32-bit
system actually be better than the python long implementation?

From my 64-bit linux system:

In [1]: n = 2 ** 40

In [2]: type(n)
Out[2]: 

In [3]: n = 2 ** 80

In [4]: type(n)
Out[4]: 


> (3) Since the loop variable is never used, why not have a special loop
> statement that repeats code so many times? This can be very fast, since the
> loop counter need not be a Python object, and probably there would be no
> need for unrolling at all:
>
> repeat 1:        # for example
>   a = a + 10
>

-- 
regards,
kushal
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Speed-up for loops

2010-09-06 Thread Stefan Behnel


BartC, 06.09.2010 12:38:

why doesn't Python 3 just accept 'xrange' as a synonym for 'range'?


Because Python 3 deliberately breaks backwards compatibility in order to 
clean up the language.




These are just some simple tests on my particular machine and
implementations, but they bring up some points:

(1) Loop unrolling does seem to have a benefit, when the loop body is
small.


Nobody said otherwise. However, "small" is a pretty weak characterisation 
here. And don't expect CPython to do it for you, because doing it 
automatically requires the ability to see that there are no side effects. 
It's a lot more costly to assure that than what you actually gain with the 
optimisation. As your example shows, doing this optimisation manually is 
pretty straight forward, so there isn't really a need to let the runtime do 
it for you.




(2) Integer arithmetic seems to go straight from 32-bits to long
integers; why not use 64-bits before needing long integers?


You are making assumptions based on Python 2, I guess. Try Python 3.1 or 
later instead, where the int and long types are unified. Also, the 
implementation is a bit more complex than you appear to be thinking. Don't 
forget that it has received serious benchmarking based optimisations.




(3) Since the loop variable is never used, why not have a special loop
statement that repeats code so many times?


Because special cases are not special enough to break the rules. As others 
have stated before, you can use itertools to reduce that part of the 
looping overhead, if it really hurts your concrete code. There also were 
lots of examples in this thread on different looping solutions and their 
performance differences. Any of them may fit your needs in a given situation.




This can be very fast, since
the loop counter need not be a Python object


It still has to count, though.

Stefan

--
http://mail.python.org/mailman/listinfo/python-list

Re: Speed-up for loops

2010-09-06 Thread BartC

"Stefan Behnel"  wrote in message 
news:mailman.470.1283712666.29448.python-l...@python.org...

BartC, 05.09.2010 19:09:



All those compilers that offer loop unrolling are therefore wasting
their time...


Sometimes they do, yes.


Modifying the OP's code a little:

a = 0
for i in xrange(1):  # 100 million
a = a + 10  # add 10 or 100
print a

Manually unrolling such a loop four times (ie. 4 copies of the body, and 
counting only to 25 million) increased the speed by between 16% and 47% (ie. 
runtime reducing by between 14% and 32%).


This depended on whether I added +10 or +100 (ie. whether long integers are 
needed), whether it was inside or outside a function, and whether I was 
running Python 2 or 3 (BTW why doesn't Python 3 just accept 'xrange' as a 
synonym for 'range'?)


These are just some simple tests on my particular machine and 
implementations, but they bring up some points:


(1) Loop unrolling does seem to have a benefit, when the loop body is small.

(2) Integer arithmetic seems to go straight from 32-bits to long integers; 
why not use 64-bits before needing long integers?


(3) Since the loop variable is never used, why not have a special loop 
statement that repeats code so many times? This can be very fast, since the 
loop counter need not be a Python object, and probably there would be no 
need for unrolling at all:


repeat 1:# for example
   a = a + 10

--
Bartc 


--
http://mail.python.org/mailman/listinfo/python-list

Re: Speed-up for loops

2010-09-05 Thread Stefan Behnel


BartC, 05.09.2010 19:09:

I've thought about it (writing an independent interpreter). But I don't
know enough of the language, and a lot of it I don't understand (eg.
OOP). Besides, I think the language itself deliberately makes it
difficult to get it up to speed. Some of the reasons might be the ones
set down here, in Chapter 2:

http://codespeak.net/svn/user/antocuni/phd/thesis/thesis.pdf


Second sentence of the abstract:

"""
However, all the existing implementations
prevent programmers from developing very efficient code.
"""

For an incorrect statement, that's a pretty bold one, even in the context 
given by the abstract. The author is lucky not to be an English native 
speaker, which explains some of the offensive wording in the intro.


Also, I'm impressed by an accepted Ph.D. thesis that comes along with just 
a bit more than four pages of references, and that fails to reference 
basically everything that the Python ecosystem provides for fast 
computation. I wouldn't mind a "faster Python", but as long as Python 
continues to be a language that allows you to very quickly get to the point 
where you can benchmark and optimise your code, and as long as CPython then 
makes it easy to do that, I won't be the one jumping up and down for a 
small factor of "general" improvement, especially when it comes at the 
price of switching to a completely different runtime. After all, it's 
really hard to appreciate that a program can now wait 5% more often on I/O 
than before. Larger performance improvements are usually due to algorithmic 
changes (including data structure adaptations and caching), rarely due to 
changes in the interpreter, especially when it's as old and well optimised 
as CPython.


I think the CPython interpreter does a really good job in providing a 
stable and predictable runtime environment and executing code in it, and 
the CPython ecosystem does a really good job in providing tools that make 
code run fast that needs to, be it due to efficient usage of I/O, CPU, 
memory, or whatever.


I'm not trying to put down the achievements of the author of the cited 
thesis, not at all. I'm sure the results are interesting for some people 
and for some kinds of applications, just like the different Python 
implementations are interesting for some people and some applications. But 
an awful lot of existing apps won't benefit at all from a fast CLI based 
Python implementation, simply because it doesn't have access (or at least 
no efficient access) to many of the existing tools in the CPython ecosystem.




The simple fact is that there are far more important things for Python
developers to spend their time and effort on than optimizations like
this.


I don't know, there's seem to be an awful lot of effort going into
addressing exactly this issue (eg. PyPy, Pscyo, Unladen Swallow,
Shedskin)


Those are very different efforts that address very different issues.



All those compilers that offer loop unrolling are therefore wasting
their time...


Sometimes they do, yes.

Stefan

--
http://mail.python.org/mailman/listinfo/python-list

Re: Speed-up for loops

2010-09-05 Thread BartC

"Steven D'Aprano"  wrote in message 
news:4c83b425$0$28657$c3e8...@news.astraweb.com...

On Sun, 05 Sep 2010 12:28:47 +0100, BartC wrote:



It would
be nice if you could directly code low-level algorithms in it without
relying on accelerators, and not have to wait two and a half minutes (or
whatever) for a simple test to complete.


Yes, and it would be nice if my clothes washed and ironed themselves, but
they don't.

Somebody has to do the work. Are you volunteering to write the JIT
compiler for CPython? Will you contribute to the PyPy project, or help
maintain Psycho, or are you just bitching?


I've thought about it (writing an independent interpreter). But I don't know 
enough of the language, and a lot of it I don't understand (eg. OOP). 
Besides, I think the language itself deliberately makes it difficult to get 
it up to speed. Some of the reasons might be the ones set down here, in 
Chapter 2:


http://codespeak.net/svn/user/antocuni/phd/thesis/thesis.pdf

(Instead, I've confined my efforts to my own projects; the last interpreter 
I worked on did in fact get within a magnitude of C in performance, without 
using JIT or other fancy tricks.)



The simple fact is that there are far more important things for Python
developers to spend their time and effort on than optimizations like this.


I don't know, there's seem to be an awful lot of effort going into 
addressing exactly this issue (eg. PyPy, Pscyo, Unladen Swallow, 
Shedskin)



If such an optimization comes out of the PyPy project, I'll be cheering
them on -- but it's a lot of effort for such a trivial gain.

The example given by the Original Poster is contrived. Nobody sensibly
writes an integer multiplication as a repeated addition like that, and
any real code that would need to be put in a for loop is almost certainly
going to be too complicated for the JIT compiler to benefit greatly. The
loop overhead itself will almost certainly be overwhelmed by the work
done in the loop:


OK, you're saying there's no point in reducing the loop overhead, because 
the loop body is going to be even slower.


All those compilers that offer loop unrolling are therefore wasting their 
time...


--
Bartc 


--
http://mail.python.org/mailman/listinfo/python-list

Re: Speed-up for loops

2010-09-05 Thread Stefan Behnel


Steven D'Aprano, 05.09.2010 17:00:

Of course, a real optimizing compiler would realise that the Pascal code
did nothing at all, and compile it all away to an empty a.out file...


Which is just one of the reasons why this kind if "benchmark" provides no 
insight into anything that should have an impact on a programmer's daily job.


Stefan

--
http://mail.python.org/mailman/listinfo/python-list

Re: Speed-up for loops

2010-09-05 Thread Steven D'Aprano

On Sun, 05 Sep 2010 12:28:47 +0100, BartC wrote:

>> Getting the above kind of code fast requires the interpreter to be
>> clever enough so that it will use native machine operations on a int
>> type instead of converting back and forth between internal
>> representations.
> 
> Writing for i in xrange(10) you'd think would give it a clue,
> but it makes no difference.

CPython takes a very conservative view towards runtime optimizations. 
Optimizations don't happen for free, you know, they have costs. Memory is 
one cost, but human time and effort is another.

But if you want a JIT compiler, see Psycho, or try PyPy, which is very 
exciting and I hope will one day be ready to take over from CPython as 
the first choice for production use.

[...]
> One order of magnitude (say 10-20x slower) wouldn't be so bad. That's
> what you might expect for a dynamically typed, interpreted language.
> 
> But on my machine this code was more like 50-200x slower than C, for
> unaccelerated Python.

I'd say that 50-200 times slower than C is exactly what I'd expect from a 
dynamically typed language like Python without any fancy JIT tricks. 
Getting such a language to within an order of magnitude of C is quite an 
achievement.

>> Generally, you use matlab's vectorized operations, and in that case,
>> numpy gives you similar performances (sometimes faster, sometimes
>> slower, but in the same ballpark in general).
> 
> That would simply be delegating Python to a scripting language. 

That's sheer unadulterated nonsense.

In any case, Python was designed as a glue language, specifically to be a 
high-level user-friendly language for gluing components written in C 
together. That's what Python *is* -- it provides a bunch of primitives, 
written in C (or Java, or dot-Net, pick whichever implementation you 
like) and manipulated in a friendly, safe language. Calling numpy for 
fast vectorized operations is *exactly* the right solution, if you need 
high-performance maths calculations.

Use the right tool for the job, don't insist that your spanner should 
double as a screwdriver.

> It would
> be nice if you could directly code low-level algorithms in it without
> relying on accelerators, and not have to wait two and a half minutes (or
> whatever) for a simple test to complete.

Yes, and it would be nice if my clothes washed and ironed themselves, but 
they don't.

Somebody has to do the work. Are you volunteering to write the JIT 
compiler for CPython? Will you contribute to the PyPy project, or help 
maintain Psycho, or are you just bitching?

The simple fact is that there are far more important things for Python 
developers to spend their time and effort on than optimizations like this.

If such an optimization comes out of the PyPy project, I'll be cheering 
them on -- but it's a lot of effort for such a trivial gain.

The example given by the Original Poster is contrived. Nobody sensibly 
writes an integer multiplication as a repeated addition like that, and 
any real code that would need to be put in a for loop is almost certainly 
going to be too complicated for the JIT compiler to benefit greatly. The 
loop overhead itself will almost certainly be overwhelmed by the work 
done in the loop:

[st...@sylar ~]$ time python -c "a = 0
> for i in xrange(1000):
> a += 10
> "

real0m6.906s
user0m5.820s
sys 0m0.022s

which is about double the time for an empty loop of the same size.

-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Speed-up for loops

2010-09-05 Thread Steven D'Aprano

On Fri, 03 Sep 2010 21:17:44 +0100, BartC wrote:

> I'm not sure the Python developers were interested in getting fast
> loops.
> 
> For-loops which iterate between two numbers are amongst the easiest
> things to make fast in a language. Yet originally you had to use:
> 
>  for i in range(N):

I don't have any versions of Python prior to version 1.5, but as far back 
as that there was always a choice between creating a list with range() 
and a lazy iterator with xrange().

> which (if I understood correctly) actually created a list of N objects,
> populated it with the values 0, 1, 2...N-1 (presumably using a more
> sensible loop), then iterated between the values of the list!

By "more sensible", do you mean "in C code"? If so, then you are correct.

> So Python had the distinction of being one of the slowest languages in
> which to do nothing (ie. running an empty loop).

Nonsense.

[st...@sylar ~]$ time python test.py

real0m3.441s
user0m2.969s
sys 0m0.024s
[st...@sylar ~]$ time perl test.pl

real0m3.490s
user0m2.722s
sys 0m0.011s
[st...@sylar ~]$ time ruby test.rb

real0m11.875s
user0m6.740s
sys 0m3.995s

The difference between an empty loop in Python and Perl is insignificant, 
and much faster than Ruby (at least the specific version of Ruby 
installed on my machine, 1.8.6).

And if you want to see the code I ran:

[st...@sylar ~]$ cat test.*
# perl
for ($i = 0; $i < 10_000_000; ++$i) {
1;
}
# python
for i in xrange(1000):
1
# ruby
for i in 0...1000
   1
end

Just for comparisons' sake:

[st...@sylar ~]$ gpc empty_test.p
[st...@sylar ~]$ time ./a.out

real0m0.106s
user0m0.070s
sys 0m0.004s
[st...@sylar ~]$ cat empty_test.p
program main(input, output);
  var
i: integer;
begin
  for i := 0 to 1000 do
begin
end;
end.

Of course, a real optimizing compiler would realise that the Pascal code 
did nothing at all, and compile it all away to an empty a.out file...

-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Speed-up for loops

2010-09-05 Thread BartC


"David Cournapeau"  wrote in message
news:mailman.455.1283665528.29448.python-l...@python.org...

On Thu, Sep 2, 2010 at 7:02 PM, Michael Kreim 
wrote:



imax = 10
a = 0
for i in xrange(imax):
   a = a + 10
print a



Unfortunately my Python Code was much slower [than Matlab] and I do not
understand why.


Getting the above kind of code fast requires the interpreter to be
clever enough so that it will use native machine operations on a int
type instead of converting back and forth between internal
representations.


Writing for i in xrange(10) you'd think would give it a clue, but it
makes no difference.


Matlab since version 6 I believe, has a JIT to do
just that. There is no mature JIT-like implementation of python which
will give you the same speed up for this exact case today.



Or do I have to live with the fact that Matlab beats Python in this
example?


Yes. Without a JIT, python cannot hope to get the same kind of speeds
for this kind of examples.

That being said, neither matlab nor matlab are especially good at
doing what you do in your example - for this exact operation, doing it
in C or other compiled languages will be at least one order of
magnitude faster.


One order of magnitude (say 10-20x slower) wouldn't be so bad. That's what
you might expect for a dynamically typed, interpreted language.

But on my machine this code was more like 50-200x slower than C, for
unaccelerated Python.


Generally, you use matlab's vectorized operations,
and in that case, numpy gives you similar performances (sometimes
faster, sometimes slower, but in the same ballpark in general).


That would simply be delegating Python to a scripting language. It would be
nice if you could directly code low-level algorithms in it without relying
on accelerators, and not have to wait two and a half minutes (or whatever) 
for a simple test to complete.


--
bartc 


--
http://mail.python.org/mailman/listinfo/python-list

Re: Speed-up for loops

2010-09-04 Thread David Cournapeau

On Thu, Sep 2, 2010 at 7:02 PM, Michael Kreim  wrote:
> Hi,
>
> I was comparing the speed of a simple loop program between Matlab and
> Python.
>
> My Codes:
> $ cat addition.py
> imax = 10
> a = 0
> for i in xrange(imax):
>    a = a + 10
> print a
>
> $ cat addition.m
> imax = 1e9;
> a = 0;
> for i=0:imax-1
>    a = a + 10;
> end
> disp(a);
> exit;
>
> The results look like this:
> $ /usr/bin/time --verbose python addition.py
> 100
>        Command being timed: "python addition.py"
>        User time (seconds): 107.30
>        System time (seconds): 0.08
>        Percent of CPU this job got: 97%
>        Elapsed (wall clock) time (h:mm:ss or m:ss): 1:50.09
>        [...]
>
> $ /usr/bin/time --verbose matlab -nodesktop -nosplash -r "addition"
> [...]
>    1.e+10
>        Command being timed: "matlab -nodesktop -nosplash -r addition"
>        User time (seconds): 7.65
>        System time (seconds): 0.18
>        Percent of CPU this job got: 94%
>        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:08.25
>        [...]
>
> Unfortunately my Python Code was much slower and I do not understand why.

Getting the above kind of code fast requires the interpreter to be
clever enough so that it will use native machine operations on a int
type instead of converting back and forth between internal
representations.  Matlab since version 6 I believe, has a JIT to do
just that. There is no mature JIT-like implementation of python which
will give you the same speed up for this exact case today.

> Or do I have to live with the fact that Matlab beats Python in this example?

Yes. Without a JIT, python cannot hope to get the same kind of speeds
for this kind of examples.

That being said, neither matlab nor matlab are especially good at
doing what you do in your example - for this exact operation, doing it
in C or other compiled languages will be at least one order of
magnitude faster. Generally, you use matlab's vectorized operations,
and in that case, numpy gives you similar performances (sometimes
faster, sometimes slower, but in the same ballpark in general).

cheers,

David
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Speed-up for loops

2010-09-04 Thread Gabriele Lanaro

Maybe for the simple sum you can just use the sum builtin:
python -m timeit -s  'sum((10,)*1)'
1000 loops, best of 3: 0.0985 usec per loop

About the loop in general it's a good practice to use list comprehension and
generator expressions

2010/9/2 Michael Kreim 

> Hi,
>
> I was comparing the speed of a simple loop program between Matlab and
> Python.
>
> My Codes:
> $ cat addition.py
> imax = 10
> a = 0
> for i in xrange(imax):
>a = a + 10
> print a
>
> $ cat addition.m
> imax = 1e9;
> a = 0;
> for i=0:imax-1
>a = a + 10;
> end
> disp(a);
> exit;
>
> The results look like this:
> $ /usr/bin/time --verbose python addition.py
> 100
>Command being timed: "python addition.py"
>User time (seconds): 107.30
>System time (seconds): 0.08
>Percent of CPU this job got: 97%
>Elapsed (wall clock) time (h:mm:ss or m:ss): 1:50.09
>[...]
>
> $ /usr/bin/time --verbose matlab -nodesktop -nosplash -r "addition"
> [...]
>1.e+10
>Command being timed: "matlab -nodesktop -nosplash -r addition"
>User time (seconds): 7.65
>System time (seconds): 0.18
>Percent of CPU this job got: 94%
>Elapsed (wall clock) time (h:mm:ss or m:ss): 0:08.25
>[...]
>
> Unfortunately my Python Code was much slower and I do not understand why.
>
> Are there any ways to speed up the for/xrange loop?
> Or do I have to live with the fact that Matlab beats Python in this
> example?
>
> Thanks a lot for your answers.
>
> With best regards,
>
> Michael
>
> --
> http://mail.python.org/mailman/listinfo/python-list
>
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Speed-up for loops

2010-09-03 Thread Stefan Behnel


BartC, 03.09.2010 22:17:

for i in range(N):

which (if I understood correctly) actually created a list of N objects,
populated it with the values 0, 1, 2...N-1 (presumably using a more
sensible loop), then iterated between the values of the list!


I guess what applies here is "special cases aren't special enough to break 
the rules". The performance is good enough in most cases, it only hurts 
when the range is large and the loop body is small in comparison, such as 
in the most obvious stupid benchmarks.


Also, xrange() is a pretty old addition the the language and now replaces 
range() in Python 3.


Stefan

--
http://mail.python.org/mailman/listinfo/python-list

Re: Speed-up for loops

2010-09-03 Thread BartC

"Michael Kreim"  wrote in message 
news:mailman.362.1283422325.29448.python-l...@python.org...


I was comparing the speed of a simple loop program between Matlab and 
Python.


My Codes:
$ cat addition.py
imax = 10
a = 0
for i in xrange(imax):
a = a + 10
print a



Unfortunately my Python Code was much slower and I do not understand why.

Are there any ways to speed up the for/xrange loop?
Or do I have to live with the fact that Matlab beats Python in this 
example?


I'm not sure the Python developers were interested in getting fast loops.

For-loops which iterate between two numbers are amongst the easiest things 
to make fast in a language. Yet originally you had to use:


for i in range(N):

which (if I understood correctly) actually created a list of N objects, 
populated it with the values 0, 1, 2...N-1 (presumably using a more sensible 
loop), then iterated between the values of the list!


So Python had the distinction of being one of the slowest languages in which 
to do nothing (ie. running an empty loop).


--
Bartc 


--
http://mail.python.org/mailman/listinfo/python-list

Re: Speed-up for loops

2010-09-03 Thread Nobody

On Fri, 03 Sep 2010 11:21:36 +0200, Michael Kreim wrote:

> An anonymous Nobody suggested to use Numpy. I did not do this, because I 
> am very very new to Numpy and I did not figure out a Numpy specific way 
> to do this. Maybe a Numpy expert has something for me?

The problem with giving examples is that your original example is too
contrived. Taken literally, it can be optimised to

print imax * 10

A less contrived example would actually do something within the loop, in
order to justify the existence of the loop.

NumPy provides predefined loops which correspond to map, reduce,
accumulate and zip, for all of the standard arithmetic operators and
common mathematical functions. E.g. if your loop was:

a = 0
for i in xrange(imax):
a += i**2
print a

the NumPy version would be:

print numpy.sum(numpy.arange(imax)**2)

The arange() function is similar to range() but generates an array. The **
operator is implemented for arrays; sum() sums the elements of the array.

The main downside is that the intermediate array(s) must be constructed in
memory, which rules out its use for very long sequences.

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Speed-up for loops

2010-09-03 Thread Tim Wintle

On Fri, 2010-09-03 at 08:52 +0200, Ulrich Eckhardt wrote:
> Tim Wintle wrote:
> > [..] under the hood, cpython does something like this (in psudo-code)
> > 
> > itterator = xrange(imax)
> > while 1:
> >   next_attribute = itterator.next
> >   try:
> > i = next_attribute()
> >   except:
> > break
> >   a = a + 10
> 
> There is one thing that strikes me here: The code claims that each iteration
> there is a lookup of the 'next' field in the iterator. I would expect that
> this is looked up once before the loop only.
> 
> Can you confirm that or am I misinterpreting your intention here?

As Stefan and Hrvoje have posted, there is a lookup - but in 2.4 and
above it's straight off the C structure and compiled efficiently.

(I've been looking at 2.3's source recently and had forgotten the
optimisation)

Tim



-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Speed-up for loops

2010-09-03 Thread Hrvoje Niksic

Ulrich Eckhardt  writes:

> Tim Wintle wrote:
>> [..] under the hood, cpython does something like this (in psudo-code)
>> 
>> itterator = xrange(imax)
>> while 1:
>>   next_attribute = itterator.next
>>   try:
>> i = next_attribute()
>>   except:
>> break
>>   a = a + 10
>
> There is one thing that strikes me here: The code claims that each
> iteration there is a lookup of the 'next' field in the iterator. I
> would expect that this is looked up once before the loop only.

It is looked up every time, but the lookup is efficient because "next"
is one of the special methods that have a slot in the C struct that
defines a Python type.  A closer code would be something like:

next_function = iterator->ob_type->tp_next;

...which is as fast as it gets.  CPython implements this in
Python/ceval.c, just look for FOR_ITER.
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Speed-up for loops

2010-09-03 Thread python

Michael,

Thanks for summarizing and sharing your results. Very interesting.

Regards,
Malcolm
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Speed-up for loops

2010-09-03 Thread Stefan Behnel


Michael Kreim, 03.09.2010 11:21:

So finally I followed the recommendation of Tim Wintle to use cython. I
did not know this before, but I figured out the following:
additionWintle2.pyx:

>

def addition():
cdef long imax = 10
cdef long a = 0
cdef long i
for i in xrange(imax):
   a = a + 10
print a

=> runs (wall clock time): 0:00.04


Note that this isn't the "real" runtime. If you look up the binary code 
that the C compiler spits out, you'll most likely find the final result for 
"a" written down as a literal that gets returned from the function. C 
compilers do these things to benchmarks these days.


Stefan

--
http://mail.python.org/mailman/listinfo/python-list

Re: Speed-up for loops

2010-09-03 Thread Michael Kreim


Hi,

thanks a lot for your answers. I learn a lot and I like to sum up your 
suggestions and show you the results of the time command on my machine:


Original code by me:
imax = 10
a = 0
for i in xrange(imax):
a = a + 10
print a
=> runs (wall clock time): 1:55.14

Peter Otten suggested to put the code into a function:
def f():
imax = 10
a = 0
for i in xrange(imax):
a = a + 10
print a
f()
=> runs (wall clock time):  0:47.69

Tim Wintle and Philip Bloom posted some code using a while loop:
imax = 10
a = 0
i = 0
while 1:
i = i + 1
if (i > imax):
break
a = a + 10
print a
=> runs (wall clock time):  3:28.05

imax = 10
a = 0
i = 0
while(i runs (wall clock time):  3:27.74


Hrvoje Niksic suggested the usage of itertools:
from itertools import repeat
imax = 10
a = 0
for i in repeat(None, imax):
a = a + 10
print a
=> runs (wall clock time): 1:58.25


I wrote a code combining these:
def f():
from itertools import repeat
imax = 10
a = 0
for i in repeat(None, imax):
a = a + 10
print a
f()
=> runs (wall clock time): 0:43.08

Then Roland Koebler suggested psyco but I am sitting on a 64bit machine 
and so I could not test it (although it looks promising).


An anonymous Nobody suggested to use Numpy. I did not do this, because I 
am very very new to Numpy and I did not figure out a Numpy specific way 
to do this. Maybe a Numpy expert has something for me?


So finally I followed the recommendation of Tim Wintle to use cython. I 
did not know this before, but I figured out the following:

additionWintle2.pyx:
def addition():
cdef long imax = 10
cdef long a = 0
cdef long i
for i in xrange(imax):
 a = a + 10
print a

setup.py:
from distutils.core import setup
from distutils.extension import Extension
from Cython.Distutils import build_ext
ext_modules = [Extension("additionWintle2", ["additionWintle2.pyx"])]
setup(
  name = 'Cython test',
  cmdclass = {'build_ext': build_ext},
  ext_modules = ext_modules
)

$ python setup.py build_ext --inplace

run.py:
from additionWintle2 import addition
addition()
running build_ext

=> runs (wall clock time):  0:00.04


And to compare this. I wrote something similar in Matlab and C++ 
(although some authors, pointed out that it is not that easy to compare 
"for" loops in these three languages):

addition.cpp
#include 
using namespace std;
int main()
{
long imax = 1e9;
long a = 0;
long i;
for(i=0; i < imax; i++)
{
a = a + 10;
}
cout << a << endl;
return 0;
}
=> Elapsed (wall clock) time (h:mm:ss or m:ss): 0:02.32

addition.m
imax = 1e9;
a = 0;
for i=0:imax-1
a = a + 10;
end
disp(a);
exit;
=> Elapsed (wall clock) time (h:mm:ss or m:ss): 0:08.39


With best regards,

Michael


--
http://mail.python.org/mailman/listinfo/python-list

Re: Speed-up for loops

2010-09-03 Thread Stefan Behnel


Ulrich Eckhardt, 03.09.2010 08:52:

Tim Wintle wrote:

[..] under the hood, cpython does something like this (in psudo-code)

itterator = xrange(imax)
while 1:
   next_attribute = itterator.next
   try:
 i = next_attribute()
   except:
 break
   a = a + 10


There is one thing that strikes me here: The code claims that each iteration
there is a lookup of the 'next' field in the iterator. I would expect that
this is looked up once before the loop only.

Can you confirm that or am I misinterpreting your intention here?


It needs to do that. Nothing keeps you from redefining "next" in each call. 
That's even a well known way to implement state machines.


However, as usual, the details are a bit different in CPython, which has a 
C level slot for the "next" method. So the lookup isn't as heavy as it looks.


Stefan

--
http://mail.python.org/mailman/listinfo/python-list

Re: Speed-up for loops

2010-09-02 Thread Ulrich Eckhardt

Tim Wintle wrote:
> [..] under the hood, cpython does something like this (in psudo-code)
> 
> itterator = xrange(imax)
> while 1:
>   next_attribute = itterator.next
>   try:
> i = next_attribute()
>   except:
> break
>   a = a + 10

There is one thing that strikes me here: The code claims that each iteration
there is a lookup of the 'next' field in the iterator. I would expect that
this is looked up once before the loop only.

Can you confirm that or am I misinterpreting your intention here?

Uli

-- 
Sator Laser GmbH
Geschäftsführer: Thorsten Föcking, Amtsgericht Hamburg HR B62 932

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Speed-up for loops

2010-09-02 Thread Terry Reedy


On 9/2/2010 8:55 AM, Tim Wintle wrote:

On Thu, 2010-09-02 at 12:02 +0200, Michael Kreim wrote:

Hi,

I was comparing the speed of a simple loop program between Matlab and
Python.



Unfortunately my Python Code was much slower and I do not understand why.


The main reason is that, under the hood, cpython does something like
this (in psudo-code)

itterator = xrange(imax)
while 1:
   next_attribute = itterator.next
   try:
 i = next_attribute()
   except:
 break
   a = a + 10

where C (and I'm assuming matlab) does this:

while 1:
   i = i + 1
   if (i>  imax):
 break
   a = a + 10


Which is to say, 'for' in python is not the same as 'for' in C/matlab 
and the latter is what Michael should use in a fair comparison. 
Otherwise, apples and oranges.



--
Terry Jan Reedy

--
http://mail.python.org/mailman/listinfo/python-list

Re: Speed-up for loops

2010-09-02 Thread Carl Banks

On Sep 2, 5:55 am, Tim Wintle  wrote:
> On Thu, 2010-09-02 at 12:02 +0200, Michael Kreim wrote:
> > Hi,
>
> > I was comparing the speed of a simple loop program between Matlab and
> > Python.
> > Unfortunately my Python Code was much slower and I do not understand why.
>
> The main reason is that, under the hood, cpython does something like
> this (in psudo-code)
>
> itterator = xrange(imax)
> while 1:
>   next_attribute = itterator.next
>   try:
>     i = next_attribute()
>   except:
>     break
>   a = a + 10
>
> where C (and I'm assuming matlab) does this:
>
> while 1:
>   i = i + 1
>   if (i > imax):
>     break
>   a = a + 10

Not really.  Someone already posted timings of the while-loop version
in Python and it's much slower than the for loop.  The iterator stuff
is a minor overhead.

The real reason is simple and boring: many languages optimize loops
like this, Python doesn't.

Matlab has a hundred paid engineers who's job is to optimize it, and
its focus is mathematics, so of course they're going to pull out every
stop to get simple loops like the above as fast as possible.

> And the function call in the python is fairly expensive on it's own.
> Plus it has to do all the standard interpreter stuff for memory
> management and deciding when to give up the GIL etc.

Matlab has all that stuff too (it's memory management is much, much
worse than Python's, in fact, but memory management usually doesn't
play into tight loop timings).

> > Are there any ways to speed up the for/xrange loop?
>
> leaving it in python, no. (well, "range" is faster in 2.x, but once you
> get some cache misses due to increased memory usage it's much slower)
>
> avoiding iteration by using list comprehensions can help a lot though as
> it avoids most of the function calls.

List comprehensions use iteration and don't avoid function calls
relative to equivalent for-loops.  I think the main reason they're a
little faster is they can use tighter bytecode.

> If you really need to optimise it then you can convert that module to
> cython by adding a cdef, and then compile it:
>
> cdef int i
> for i in xrange(imax):
>      a = a + 10
> print a
>
> or you can write it in C it'll run a lot faster.

numpy is terrific when you can use it, and I've found that it can do a
lot more than most people expect.  The hard part is figuring out how.

In particular, numpy will trounce Matlab's performance for large
amounts of data, because of the aforementioned memory management
problem.

Carl Banks
-- 
http://mail.python.org/mailman/listinfo/python-list

RE: Speed-up for loops

2010-09-02 Thread Peter Otten

Philip Bloom wrote:

> Uh.

> Try:
> Imax=10
> a=0
> i=0
> While(i a= a+10
>   i=i+1
> print a

> I suspect you will find it is way faster than using range or xrange for
> large numbers and map far more closely in the final result to what you
> are doing on matlab's side.  At least last I checked, xrange and range
> both involve iterating through an array, which is much slower in all
> cases than just doing an int vs int compare (which is what your matlab
> is doing).

How did you check? 

$ python -m timeit "for i in xrange(100): pass"
10 loops, best of 3: 47.5 msec per loop
$ python -m timeit "i = 0" "while i < 100: i += 1"
10 loops, best of 3: 152 msec per loop
$

So an empty while loop takes about three times as long as the equivalent for 
loop. Also:

"""
class xrange(object)
 |  xrange([start,] stop[, step]) -> xrange object
 |
 |  Like range(), but instead of returning a list, returns an object that
 |  generates the numbers in the range on demand.  For looping, this is
 |  slightly faster than range() and more memory efficient.
"""

Peter
-- 
http://mail.python.org/mailman/listinfo/python-list

RE: Speed-up for loops

2010-09-02 Thread Philip Bloom

Uh.  

Try:
Imax=10
a=0
i=0
While(imailto:python-list-bounces+pbloom=crystald@python.org] On Behalf Of
Nobody
Sent: Thursday, September 02, 2010 9:05 AM
To: python-list@python.org
Subject: Re: Speed-up for loops

On Thu, 02 Sep 2010 12:02:40 +0200, Michael Kreim wrote:

> I was comparing the speed of a simple loop program between Matlab and 
> Python.

> imax = 10
> a = 0
> for i in xrange(imax):
>  a = a + 10
> print a

> Are there any ways to speed up the for/xrange loop?

Sure; the above can be reduced to just:

print imax * 10
;)

More seriously, if you're comparing against Matlab, you should look at
NumPy. If there's a reasonably direct approach using NumPy, it will be
much quicker than a Python "for" loop (in a sense, NumPy is a library of
useful "for" loops implemented in C).

Even a fairly indirect NumPy approach is often quicker than pure Python.

-- 
http://mail.python.org/mailman/listinfo/python-list

__
This email has been scanned by the MessageLabs
__

__
This email has been scanned by the MessageLabs
__
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Speed-up for loops

2010-09-02 Thread Tim Wintle

On Thu, 2010-09-02 at 16:13 +0200, Roland Koebler wrote:
> Hi,
> 
> > Are there any ways to speed up the for/xrange loop?
> You can use psyco.

Assuming you've got a 32-bit machine.


-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Speed-up for loops

2010-09-02 Thread Nobody

On Thu, 02 Sep 2010 12:02:40 +0200, Michael Kreim wrote:

> I was comparing the speed of a simple loop program between Matlab and 
> Python.

> imax = 10
> a = 0
> for i in xrange(imax):
>  a = a + 10
> print a

> Are there any ways to speed up the for/xrange loop?

Sure; the above can be reduced to just:

print imax * 10
;)

More seriously, if you're comparing against Matlab, you should look at
NumPy. If there's a reasonably direct approach using NumPy, it will be
much quicker than a Python "for" loop (in a sense, NumPy is a library of
useful "for" loops implemented in C).

Even a fairly indirect NumPy approach is often quicker than pure Python.

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Speed-up for loops

2010-09-02 Thread Roland Koebler

Hi,

> Are there any ways to speed up the for/xrange loop?
You can use psyco.

The following example should be about 4-times as fast as your example:

import psyco
psyco.full()
def f():
imax = 10
a = 0
for i in xrange(imax):
a += 10 
print a
f()


regards,
Roland

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Speed-up for loops

2010-09-02 Thread Hrvoje Niksic

Michael Kreim  writes:

> Are there any ways to speed up the for/xrange loop?
> Or do I have to live with the fact that Matlab beats Python in this
> example?

To a point, yes.  However, there are things you can do to make your
Python code go faster.  One has been pointed out by Peter.

Another is that Python treats numbers as regular heap objects, so
creating a bunch of unneeded integers by xrange slows things down
(despite allocation of integer objects being heavily optimized).  For
this reason, you may want to change xrange(10) to
itertools.repeat(None, 10).

$ python -m timeit -s 'from itertools import repeat' 'for _ in repeat(None, 
10): pass'
1000 loops, best of 3: 1.71 msec per loop
$ python -m timeit -s 'from itertools import repeat' 'for _ in xrange(10): 
pass'
100 loops, best of 3: 2.43 msec per loop
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Speed-up for loops

2010-09-02 Thread Stefan Behnel


Tim Wintle, 02.09.2010 14:55:

If you really need to optimise it then you can convert that module to
cython by adding a cdef, and then compile it:

cdef int i
for i in xrange(imax):
  a = a + 10
print a

or you can write it in C it'll run a lot faster.


Just to get the context right here: a C implementation won't run even a tad 
faster than the obvious Cython version, but both will run "a lot faster" 
than the Python version.


Plus, if Cython knows that the imax value is small enough, it'll infer 
"int" for the "i" variable automatically, so you won't need the "cdef" 
annotation. It won't automatically do that for "a", though, as that might 
break Python's unlimited integer semantics if "imax" and/or "a" are large 
enough.


Stefan

--
http://mail.python.org/mailman/listinfo/python-list

Re: Speed-up for loops

2010-09-02 Thread Tim Wintle

On Thu, 2010-09-02 at 12:02 +0200, Michael Kreim wrote:
> Hi,
> 
> I was comparing the speed of a simple loop program between Matlab and 
> Python.

> Unfortunately my Python Code was much slower and I do not understand why.

The main reason is that, under the hood, cpython does something like
this (in psudo-code)

itterator = xrange(imax)
while 1:
  next_attribute = itterator.next
  try:
i = next_attribute()
  except:
break
  a = a + 10

where C (and I'm assuming matlab) does this:

while 1:
  i = i + 1
  if (i > imax):
break
  a = a + 10

And the function call in the python is fairly expensive on it's own.
Plus it has to do all the standard interpreter stuff for memory
management and deciding when to give up the GIL etc.

> Are there any ways to speed up the for/xrange loop?

leaving it in python, no. (well, "range" is faster in 2.x, but once you
get some cache misses due to increased memory usage it's much slower)

avoiding iteration by using list comprehensions can help a lot though as
it avoids most of the function calls.

If you really need to optimise it then you can convert that module to
cython by adding a cdef, and then compile it:

cdef int i
for i in xrange(imax):
 a = a + 10
print a

or you can write it in C it'll run a lot faster.

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Speed-up for loops

2010-09-02 Thread Peter Otten

Michael Kreim wrote:

> Peter Otten wrote:
>> Move it into a function; this turns a and i into local variables.
>> 
>> def f():
>> imax = 10
>> a = 0
>> for i in xrange(imax):
>> a = a + 10
>> print a
>> f()
> 
> Wow. It is still slower than Matlab, but your suggestion speeds up the
> code by ca 50%.
> But I do not understand why the change of a global to a local variable
> gives such a big difference.

Basically the local namespace is a C array where accessing an item is just 
pointer arithmetic while the global namespace is a Python dictionary. 

There may be optimisations for the latter. If you can read C have a look at 
the LOAD/STORE_FAST and LOAD/STORE_GLOBAL implementations for the gory 
details:

http://svn.python.org/view/python/trunk/Python/ceval.c?view=markup

Peter

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Speed-up for loops

2010-09-02 Thread Michael Kreim


Peter Otten wrote:

Move it into a function; this turns a and i into local variables.

def f():
imax = 10
a = 0
for i in xrange(imax):
a = a + 10
print a
f()


Wow. It is still slower than Matlab, but your suggestion speeds up the 
code by ca 50%.
But I do not understand why the change of a global to a local variable 
gives such a big difference.



$ cat addition.py
imax = 10
a = 0
for i in xrange(imax):
a = a + 10
print a

$ cat additionOtten.py
def f():
imax = 10
a = 0
for i in xrange(imax):
a = a + 10
print a
f()

$ /usr/bin/time --verbose python addition.py
100
Command being timed: "python addition.py"
User time (seconds): 110.52
System time (seconds): 0.01
Percent of CPU this job got: 98%
Elapsed (wall clock) time (h:mm:ss or m:ss): 1:52.76
[...]

$ /usr/bin/time --verbose python additionOtten.py
100
Command being timed: "python additionOtten.py"
User time (seconds): 56.38
System time (seconds): 0.00
Percent of CPU this job got: 99%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:56.64
[...]
--
http://mail.python.org/mailman/listinfo/python-list

Re: Speed-up for loops

2010-09-02 Thread Peter Otten

Michael Kreim wrote:

> I was comparing the speed of a simple loop program between Matlab and
> Python.
> 
> My Codes:
> $ cat addition.py
> imax = 10
> a = 0
> for i in xrange(imax):
>  a = a + 10
> print a

> Are there any ways to speed up the for/xrange loop?

Move it into a function; this turns a and i into local variables.

def f():
imax = 10
a = 0
for i in xrange(imax):
a = a + 10
print a
f()

> Or do I have to live with the fact that Matlab beats Python in this
> example?

I think so.
-- 
http://mail.python.org/mailman/listinfo/python-list

Speed-up for loops

2010-09-02 Thread Michael Kreim


Hi,

I was comparing the speed of a simple loop program between Matlab and 
Python.


My Codes:
$ cat addition.py
imax = 10
a = 0
for i in xrange(imax):
a = a + 10
print a

$ cat addition.m
imax = 1e9;
a = 0;
for i=0:imax-1
a = a + 10;
end
disp(a);
exit;

The results look like this:
$ /usr/bin/time --verbose python addition.py
100
Command being timed: "python addition.py"
User time (seconds): 107.30
System time (seconds): 0.08
Percent of CPU this job got: 97%
Elapsed (wall clock) time (h:mm:ss or m:ss): 1:50.09
[...]

$ /usr/bin/time --verbose matlab -nodesktop -nosplash -r "addition"
[...]
1.e+10
Command being timed: "matlab -nodesktop -nosplash -r addition"
User time (seconds): 7.65
System time (seconds): 0.18
Percent of CPU this job got: 94%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:08.25
[...]

Unfortunately my Python Code was much slower and I do not understand why.

Are there any ways to speed up the for/xrange loop?
Or do I have to live with the fact that Matlab beats Python in this example?

Thanks a lot for your answers.

With best regards,

Michael
--
http://mail.python.org/mailman/listinfo/python-list

53 matches

Mail list logo