Re: Numpy slow at vector cross product?

2016-11-22 Thread BartC

On 22/11/2016 16:48, Steve D'Aprano wrote:

On Tue, 22 Nov 2016 11:45 pm, BartC wrote:



I will have a look. Don't forget however that all someone is trying to
do is to multiply two vectors. They're not interested in axes
transformation or making them broadcastable, whatever that means.


You don't know that.


When I did vector arithmetic at school, the formula for the 
cross-product of (x1,y1,z1) with (x2,y2,z2) was quite straightforward. 
This is what you expect given a library function to do the job. 
Especially in an implementation that could be crippled by running as 
unoptimised byte-code.


If someone was given the task of cross-multiplying two three-element 
vectors, the formula in the text-book (or Wikipedia) is what they would 
write.



Could numpy optimize the single pair of vectors case a bit better? Perhaps
they could. It looks like there's a bunch of minor improvements which could
be made to the code, a few micro-optimizations that shave a microsecond or
two off the execution time. And maybe they could even detect the case where
the arguments are a single pair of vectors, and optimize that. Even
replacing it with a naive pure-Python cross product would be a big win.


The code it ends up executing for individual pairs of vectors is a joke. 
97% of what it does is nothing to do with the cross-product!



But for the big array of vectors case, you absolutely have to support doing
fast vectorized cross-products over a huge number of vectors.


That's how I expected numpy to work. I said something along those lines 
in my first post:


BC:
> Maybe numpy has extra overheads, and the arrays being operated on are
> very small, but even so, 30 times slower than CPython? (2.5 to 0.083
> seconds.)



py> a = np.array([
... [1, 2, 3],
... [4, 5, 6],
... [7, 8, 9],
... ]*1000
... )
py> b = np.array([
... [9, 8, 7],
... [6, 5, 4],
... [3, 2, 1],
... ]*1000
... )
py> a.shape
(3000, 3)
py> result = np.cross(a, b)
py> result.shape
(3000, 3)

On my computer, numpy took only 10 times longer to cross-multiply 3000 pairs
of vectors than it took to cross-multiply a single pair of vectors. If I
did that in pure Python, it would take 3000 times longer, or more, so numpy
wins here by a factor of 300.


I tested this with 3 million pairs on my machine. It managed 8 million 
products per second (about the same as the simplest Python code, but run 
with pypy), taking nearly 0.4 seconds. Except it took over 4 seconds to 
create the special numpy arrays!


Setting up ordinary arrays was much faster, but then the cross-product 
calculation was slower.


--
bartc
--
https://mail.python.org/mailman/listinfo/python-list


Re: Numpy slow at vector cross product?

2016-11-22 Thread Steve D'Aprano
On Tue, 22 Nov 2016 11:45 pm, BartC wrote:


> I will have a look. Don't forget however that all someone is trying to
> do is to multiply two vectors. They're not interested in axes
> transformation or making them broadcastable, whatever that means.

You don't know that.

Bart, you have a rather disagreeable tendency towards assuming that if you
don't want to do something, then nobody else could possibly want to do it.
You should try to keep an open mind to the possibility that perhaps there
are use-cases that you didn't think of. numpy is not *the* most heavily
used third-party library in the Python ecosystem because its slow.

numpy is a library for doing vectorised operations over massive arrays.
You're thinking of numpy users doing the cross product of two vectors, but
you should be thinking of numpy users doing the cross product of a million
pairs of vectors.

Could numpy optimize the single pair of vectors case a bit better? Perhaps
they could. It looks like there's a bunch of minor improvements which could
be made to the code, a few micro-optimizations that shave a microsecond or
two off the execution time. And maybe they could even detect the case where
the arguments are a single pair of vectors, and optimize that. Even
replacing it with a naive pure-Python cross product would be a big win.

But for the big array of vectors case, you absolutely have to support doing
fast vectorized cross-products over a huge number of vectors.

py> a = np.array([
... [1, 2, 3],
... [4, 5, 6],
... [7, 8, 9],
... ]*1000
... )
py> b = np.array([
... [9, 8, 7],
... [6, 5, 4],
... [3, 2, 1],
... ]*1000
... )
py> a.shape
(3000, 3)
py> result = np.cross(a, b)
py> result.shape
(3000, 3)


On my computer, numpy took only 10 times longer to cross-multiply 3000 pairs
of vectors than it took to cross-multiply a single pair of vectors. If I
did that in pure Python, it would take 3000 times longer, or more, so numpy
wins here by a factor of 300.




-- 
Steve
“Cheer up,” they said, “things could be worse.” So I cheered up, and sure
enough, things got worse.

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Numpy slow at vector cross product?

2016-11-22 Thread BartC

On 22/11/2016 12:45, BartC wrote:

On 22/11/2016 12:34, Skip Montanaro wrote:

I'm simply suggesting there is plenty of room for improvement. I even

showed a version that did *exactly* what numpy does (AFAIK) that was
three
times the speed of numpy even executed by CPython. So there is some
mystery
there.

As I indicated in my earlier response, your version doesn't pass all of
numpy's cross product unit tests. Fix that and submit a patch to the
numpy
maintainers. I suspect it would be accepted.


I saw your response but didn't understand it. My code was based around
what Peter Otten posted from numpy sources.

I will have a look. Don't forget however that all someone is trying to
do is to multiply two vectors. They're not interested in axes
transformation or making them broadcastable, whatever that means.


It seems the posted code of numpy.cross wasn't complete. The full code 
is below (I've removed the doc string, but have had to add some 'np.' 
prefixes as it is now out of context).


If anyone is still wondering why numpy.cross is slow, then no further 
comments are necessary!


--
def cross(a, b, axisa=-1, axisb=-1, axisc=-1, axis=None):

if axis is not None:
axisa, axisb, axisc = (axis,) * 3
a = np.asarray(a)
b = np.asarray(b)
# Check axisa and axisb are within bounds
axis_msg = "'axis{0}' out of bounds"
if axisa < -a.ndim or axisa >= a.ndim:
raise ValueError(axis_msg.format('a'))
if axisb < -b.ndim or axisb >= b.ndim:
raise ValueError(axis_msg.format('b'))
# Move working axis to the end of the shape
a = np.rollaxis(a, axisa, a.ndim)
b = np.rollaxis(b, axisb, b.ndim)
msg = ("incompatible dimensions for cross product\n"
   "(dimension must be 2 or 3)")
if a.shape[-1] not in (2, 3) or b.shape[-1] not in (2, 3):
raise ValueError(msg)

# Create the output array
shape = np.broadcast(a[..., 0], b[..., 0]).shape
if a.shape[-1] == 3 or b.shape[-1] == 3:
shape += (3,)
# Check axisc is within bounds
if axisc < -len(shape) or axisc >= len(shape):
raise ValueError(axis_msg.format('c'))
dtype = np.promote_types(a.dtype, b.dtype)
cp = np.empty(shape, dtype)

# create local aliases for readability
a0 = a[..., 0]
a1 = a[..., 1]
if a.shape[-1] == 3:
a2 = a[..., 2]
b0 = b[..., 0]
b1 = b[..., 1]
if b.shape[-1] == 3:
b2 = b[..., 2]
if cp.ndim != 0 and cp.shape[-1] == 3:
cp0 = cp[..., 0]
cp1 = cp[..., 1]
cp2 = cp[..., 2]

if a.shape[-1] == 2:
if b.shape[-1] == 2:
# a0 * b1 - a1 * b0
multiply(a0, b1, out=cp)
cp -= a1 * b0
return cp
else:
assert b.shape[-1] == 3
# cp0 = a1 * b2 - 0  (a2 = 0)
# cp1 = 0 - a0 * b2  (a2 = 0)
# cp2 = a0 * b1 - a1 * b0
np.multiply(a1, b2, out=cp0)
np.multiply(a0, b2, out=cp1)
np.negative(cp1, out=cp1)
np.multiply(a0, b1, out=cp2)
cp2 -= a1 * b0
else:
assert a.shape[-1] == 3
if b.shape[-1] == 3:
# cp0 = a1 * b2 - a2 * b1
# cp1 = a2 * b0 - a0 * b2
# cp2 = a0 * b1 - a1 * b0
np.multiply(a1, b2, out=cp0)
tmp = np.array(a2 * b1)
cp0 -= tmp
np.multiply(a2, b0, out=cp1)
np.multiply(a0, b2, out=tmp)
cp1 -= tmp
np.multiply(a0, b1, out=cp2)
np.multiply(a1, b0, out=tmp)
cp2 -= tmp
else:
assert b.shape[-1] == 2
# cp0 = 0 - a2 * b1  (b2 = 0)
# cp1 = a2 * b0 - 0  (b2 = 0)
# cp2 = a0 * b1 - a1 * b0
np.multiply(a2, b1, out=cp0)
np.negative(cp0, out=cp0)
np.multiply(a2, b0, out=cp1)
np.multiply(a0, b1, out=cp2)
cp2 -= a1 * b0

# This works because we are moving the last axis
return np.rollaxis(cp, -1, axisc)


--
Bartc
--
https://mail.python.org/mailman/listinfo/python-list


Re: Numpy slow at vector cross product?

2016-11-22 Thread eryk sun
On Tue, Nov 22, 2016 at 1:06 PM, BartC  wrote:
>> In this specific example, the OP is comparing two radically different
>> pieces of code that clearly and obviously perform differently. He's doing
>> the equivalent of timing the code with his heartbeat, and getting 50 beats
>> for one and 150 beats for the other. That's good enough to show gross
>> differences in performance.
>
> No, he's using time.clock() with, presumably, a consistent number of ticks
> per second.

Note that some people in this discussion use Unix systems, on which
using time.clock for wall-clock timing is completely wrong. Please use
timeit.default_timer to ensure that examples are portable.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Numpy slow at vector cross product?

2016-11-22 Thread BartC

On 22/11/2016 03:00, Steve D'Aprano wrote:

On Tue, 22 Nov 2016 12:45 pm, BartC wrote:



You get to know after while what kinds of processes affect timings. For
example, streaming a movie at the same time.


Really, no.



py> with Stopwatch():
... x = math.sin(1.234)
...
elapsed time is very small; consider using the timeit module for
micro-timings of small code snippets
time taken: 0.007164 seconds


I tried 'Stopwatch()' and it didn't understand what it was.


And again:

py> with Stopwatch():
... x = math.sin(1.234)
...
elapsed time is very small; consider using the timeit module for
micro-timings of small code snippets
time taken: 0.14 seconds


Look at the variation in the timing: 0.007164 versus 0.14 second. That's
the influence of a cache, or more than one cache, somewhere. But if I run
it again:

py> with Stopwatch():
... x = math.sin(1.234)
...
elapsed time is very small; consider using the timeit module for
micro-timings of small code snippets
time taken: 0.13 seconds

there's a smaller variation, this time "only" 7%, for code which hasn't
changed. That's what your up against.


I tried this:

import math

def fn():
for i in xrange(300):
x=math.sin(1.234)
print x

fn()

If I run this, my IDE tells me the whole thing took 0.93 seconds. That's 
including the overheads of the IDE, invoking python.exe, Python's 
start-up time, the function call, the loop overheads, and whatever is 
involved in shutting it all down again.


Now I replace the x=math.sin() line with pass. The IDE now says 0.21 
seconds. The only thing that's changed is the sin() call. I can probably 
deduce that executing x=math.sin(1.2340) three million times took 0.72 
seconds (with some expected variation).


So over 4 million times per second. Which is interesting because your 
timings with Stopwatch varied from 140 to 71000 per second (and I doubt 
your machine is that much slower than mine).




[steve@ando ~]$ python2.7 -m timeit -s "x = 257" "3*x"
1000 loops, best of 3: 0.106 usec per loop
[steve@ando ~]$ python3.5 -m timeit -s "x = 257" "3*x"
1000 loops, best of 3: 0.137 usec per loop



That's *brilliant* and much simpler than anything you are doing with loops
and clocks and whatnot.


But you lose all context. And there is a limit to what you can do with 
such micro-benchmarks.


Sometimes you want to time a real task, but with one line substituted 
for another, or some other minor rearrangement.


Or perhaps the overall time of a code fragment depends on the mix of 
data it has to deal with.



Code will normally exist as a proper part of a module, not on the
command line, in a command history, or in a string, so why not test it
running inside a module?


Sure, you can do that, if you want potentially inaccurate results.


When your customer runs your application (and complains about how long 
it takes), they will be running the code as normal not inside a string 
or on the command line!



In this specific example, the OP is comparing two radically different pieces
of code that clearly and obviously perform differently. He's doing the
equivalent of timing the code with his heartbeat, and getting 50 beats for
one and 150 beats for the other. That's good enough to show gross
differences in performance.


No, he's using time.clock() with, presumably, a consistent number of 
ticks per second.



But often you're comparing two code snippets which are very nearly the same,
and trying to tease out a real difference of (say) 3% out of a noisy signal
where each run may differ by 10% just from randomness. Using your heartbeat
to time code is not going to do it.


Yes, exactly, that's why typing a code fragment on the command line is a 
waste of time. You need to design a test where the differences are going 
to be more obvious.


--
Bartc
--
https://mail.python.org/mailman/listinfo/python-list


Re: Numpy slow at vector cross product?

2016-11-22 Thread BartC

On 22/11/2016 12:34, Skip Montanaro wrote:

I'm simply suggesting there is plenty of room for improvement. I even

showed a version that did *exactly* what numpy does (AFAIK) that was three
times the speed of numpy even executed by CPython. So there is some mystery
there.

As I indicated in my earlier response, your version doesn't pass all of
numpy's cross product unit tests. Fix that and submit a patch to the numpy
maintainers. I suspect it would be accepted.


I saw your response but didn't understand it. My code was based around 
what Peter Otten posted from numpy sources.


I will have a look. Don't forget however that all someone is trying to 
do is to multiply two vectors. They're not interested in axes 
transformation or making them broadcastable, whatever that means.


So making numpy.cross do all that may simply be too big a cost.

--
bartc

--
https://mail.python.org/mailman/listinfo/python-list


Re: Numpy slow at vector cross product?

2016-11-22 Thread Skip Montanaro
> I'm simply suggesting there is plenty of room for improvement. I even
showed a version that did *exactly* what numpy does (AFAIK) that was three
times the speed of numpy even executed by CPython. So there is some mystery
there.

As I indicated in my earlier response, your version doesn't pass all of
numpy's cross product unit tests. Fix that and submit a patch to the numpy
maintainers. I suspect it would be accepted.

Skip
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Numpy slow at vector cross product?

2016-11-22 Thread BartC

On 22/11/2016 02:44, Steve D'Aprano wrote:

On Tue, 22 Nov 2016 05:43 am, BartC wrote:


The fastest I can get compiled, native code to do this is at 250 million
cross-products per second.


(Actually 300 million using 64-bit code.)


Yes, yes, you're awfully clever, and your secret private language is so much
more efficient than even C that the entire IT industry ought to hang their
head in shame.


The 250M/300M timing was using C and gcc-O3.

This gives an indication of the upper limit of what is possible.

The suggestion was that the numpy solution, which was *one thousand time 
slower* than these figures, wouldn't be any faster if written in C.


I'm simply suggesting there is plenty of room for improvement. I even 
showed a version that did *exactly* what numpy does (AFAIK) that was 
three times the speed of numpy even executed by CPython. So there is 
some mystery there.


(FWIW my own compiled language manages 70M, and my interpreted language 
up to 3M. Poor, so I might look at them again next time I have to do 
loads of vector-products.)



I'm only being *half* sarcastic here, for what its worth. I remember the
days when I could fit an entire operating system, plus applications, on a
400K floppy disk, and they would run at acceptable speed on something like
an 8 MHz CPU. Code used to be more efficient, with less overhead. But given
that your magic compiler runs only on one person's PC in the entire world,
it is completely irrelevant.


So you're saying that numpy's speed is perfectly reasonable and we 
should do nothing about it? Because, after all, it does a few extra 
clever things (whether the programmer wants them or not!).


And the way it is written is completely beyond reproach:

msg = "incompatible dimensions for cross product\n"\
  "(dimension must be 2 or 3)"
if (a.shape[0] not in [2, 3]) or (b.shape[0] not in [2, 3]):
raise ValueError(msg)

So it is fine for msg to be bound to a string value here (and having to 
unbind it again when it returns) even though it will never be used again 
in a normal call. (Or having to test both a.shape[0] and b.shape[0] for 
inclusion in a list, expensive looking operations, when they are tested 
again anyway in the next few lines.)


--
Bartc
--
https://mail.python.org/mailman/listinfo/python-list


Re: Numpy slow at vector cross product?

2016-11-21 Thread Paul Rubin
Steven D'Aprano  writes:
> if we knew we should be doing it, and if we could be bothered to run
> multiple trials and gather statistics and keep a close eye on the
> deviation between measurements. But who wants to do that by hand?

You might like this, for Haskell:

   http://www.serpentine.com/criterion/tutorial.html

I've sometimes thought of wrapping it around other languages.

Make sure to click on the graphs: it's impressive work.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Numpy slow at vector cross product?

2016-11-21 Thread Steven D'Aprano
On Tuesday 22 November 2016 14:00, Steve D'Aprano wrote:

> Running a whole lot of loops can, sometimes, mitigate some of that
> variation, but not always. Even when running in a loop, you can easily get
> variation of 10% or more just at random.

I think that needs to be emphasised: there's a lot of random noise in these 
measurements.

For big, heavyweight functions that do a lot of work, the noise is generally a 
tiny proportion, and you can safely ignore it. (At least for CPU bound tasks: 
I/O bound tasks, the noise in I/O is potentially very high.)

For really tiny operations, the noise *may* be small, depending on the 
operation.  But small is not insignificant. Consider a simple operation like 
addition:

# Python 3.5
import statistics
from timeit import Timer
t = Timer("x + 1", setup="x = 0")
# ten trials, of one million loops each
results = t.repeat(repeat=10)
best = min(results)
average = statistics.mean(results)
std_error = statistics.stdev(results)/statistics.mean(results)


Best: 0.09761243686079979
Average: 0.0988507878035307
Std error: 0.02260956789268462

So this suggests that on my machine, doing no expensive virus scans or 
streaming video, the random noise in something as simple as integer addition is 
around two percent.

So that's your baseline: even simple operations repeated thousands of times 
will show random noise of a few percent.

Consequently, if you're doing one trial (one loop of, say, a million 
operations):

start = time.time()
for i in range(100):
x + 1
elapsed = time.time() - start


and compare the time taken with another trial, and the difference is of the 
order of a few percentage points, then you have *no* reason to believe the 
result is real. You ought to repeat your test multiple times -- the more the 
better.

timeit makes it easy to repeat your tests. It automatically picks the best 
timer for your platform and avoid serious gotchas from using the wrong timer. 
When called from the command line, it will automatically select the best number 
of loops to ensure reliable timing, without wasting time doing more loops than 
needed.

timeit isn't magic. It's not doing anything that you or I couldn't do by hand, 
if we knew we should be doing it, and if we could be bothered to run multiple 
trials and gather statistics and keep a close eye on the deviation between 
measurements. But who wants to do that by hand?



-- 
Steven
299792.458 km/s — not just a good idea, it’s the law!

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Numpy slow at vector cross product?

2016-11-21 Thread Steve D'Aprano
On Tue, 22 Nov 2016 12:45 pm, BartC wrote:

> On 21/11/2016 14:50, Steve D'Aprano wrote:
>> On Mon, 21 Nov 2016 11:09 pm, BartC wrote:
> 
>> Modern machines run multi-tasking operating systems, where there can be
>> other processes running. Depending on what you use as your timer, you may
>> be measuring the time that those other processes run. The OS can cache
>> frequently used pieces of code, which allows it to run faster. The CPU
>> itself will cache some code.
> 
> You get to know after while what kinds of processes affect timings. For
> example, streaming a movie at the same time. 

Really, no.

You'll just have to take my word on this, but I'm not streaming any movies
at the moment. I don't even have a web browser running. And since I'm
running Linux, I don't have an anti-virus scanner that might have just
triggered a scan.

(But since I'm running Linux, I do have a web server, mail server, a DNS
server, cron, and about 300 other processes running, any of which might
start running for a microsecond or ten in the middle of a job.)

py> with Stopwatch():
... x = math.sin(1.234)
...
elapsed time is very small; consider using the timeit module for
micro-timings of small code snippets
time taken: 0.007164 seconds


And again:

py> with Stopwatch():
... x = math.sin(1.234)
...
elapsed time is very small; consider using the timeit module for
micro-timings of small code snippets
time taken: 0.14 seconds


Look at the variation in the timing: 0.007164 versus 0.14 second. That's
the influence of a cache, or more than one cache, somewhere. But if I run
it again:

py> with Stopwatch():
... x = math.sin(1.234)
...
elapsed time is very small; consider using the timeit module for
micro-timings of small code snippets
time taken: 0.13 seconds

there's a smaller variation, this time "only" 7%, for code which hasn't
changed. That's what your up against.

Running a whole lot of loops can, sometimes, mitigate some of that
variation, but not always. Even when running in a loop, you can easily get
variation of 10% or more just at random.


> So when you need to compare timings, you turn those off.
> 
>> The shorter the code snippet, the more these complications are relevant.
>> In this particular case, we can be reasonably sure that the time it takes
>> to create a list range(1) and the overhead of the loop is *probably*
>> quite a small percentage of the time it takes to perform 10 vector
>> multiplications. But that's not a safe assumption for all code snippets.
> 
> Yes, it was one of those crazy things that Python used to have to do,
> creating a list of N numbers just in order to be able to count to N.

Doesn't matter. Even with xrange, you're still counting the cost of looking
up xrange, passing one or more arguments to it, parsing those arguments,
creating an xrange object, and iterating over that xrange object
repeatedly. None of those things are free.

You might *hope* that the cost of those things are insignificant compared to
what you're actually interested in timing, but you don't know. And you're
resisting the idea of using a tool that is specifically designed to avoid
measuring all that overhead.

It's okay that your intuitions about the cost of executing Python code is
inaccurate. What's not okay is your refusal to listen to those who have a
better idea of what's involved.


[...]
>> The timeit module automates a bunch of tricky-to-right best practices for
>> timing code. Is that a problem?
> 
> The problem is it substitutes a bunch of tricky-to-get-right options and
> syntax which has to to typed /at the command line/. And you really don't
> want to have to write code at the command line (especially if sourced
> from elsewhere, which means you have to transcribe it).

You have to transcribe it no matter what you do. Unless you are given
correctly written timing code.

You don't have to use timeit from the command line. But you're mad if you
don't: the smaller the code snippet, the more convenient it is.

[steve@ando ~]$ python2.7 -m timeit -s "x = 257" "3*x"
1000 loops, best of 3: 0.106 usec per loop
[steve@ando ~]$ python3.5 -m timeit -s "x = 257" "3*x"
1000 loops, best of 3: 0.137 usec per loop


That's *brilliant* and much simpler than anything you are doing with loops
and clocks and whatnot. Its simple, straightforward, and tells me exactly
what I expected to see. (Python 3.6 will be even better.)

For the record, the reason Python 3.5 is so much slower here is because it
is a debugging build.


>> But if you prefer doing it "old school" from within Python, then:
>>
>> from timeit import Timer
>> t = Timer('np.cross(x, y)',  setup="""
>> import numpy as np
>> x = np.array([1, 2, 3])
>> y = np.array([4, 5, 6])
>> """)
>>
>> # take five measurements of 10 calls each, and report the fastest
>> result = min(t.repeat(number=10, repeat=5))/10
>> print(result)  # time in seconds per call
> 
>> Better?
> 
> A bit, but the code is now inside a string!

As oppo

Re: Numpy slow at vector cross product?

2016-11-21 Thread Steve D'Aprano
On Tue, 22 Nov 2016 05:43 am, BartC wrote:

> The fastest I can get compiled, native code to do this is at 250 million
> cross-products per second.


Yes, yes, you're awfully clever, and your secret private language is so much
more efficient than even C that the entire IT industry ought to hang their
head in shame.

I'm only being *half* sarcastic here, for what its worth. I remember the
days when I could fit an entire operating system, plus applications, on a
400K floppy disk, and they would run at acceptable speed on something like
an 8 MHz CPU. Code used to be more efficient, with less overhead. But given
that your magic compiler runs only on one person's PC in the entire world,
it is completely irrelevant.



-- 
Steve
“Cheer up,” they said, “things could be worse.” So I cheered up, and sure
enough, things got worse.

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Numpy slow at vector cross product?

2016-11-21 Thread BartC

On 21/11/2016 14:50, Steve D'Aprano wrote:

On Mon, 21 Nov 2016 11:09 pm, BartC wrote:



Modern machines run multi-tasking operating systems, where there can be
other processes running. Depending on what you use as your timer, you may
be measuring the time that those other processes run. The OS can cache
frequently used pieces of code, which allows it to run faster. The CPU
itself will cache some code.


You get to know after while what kinds of processes affect timings. For 
example, streaming a movie at the same time. So when you need to compare 
timings, you turn those off.



The shorter the code snippet, the more these complications are relevant. In
this particular case, we can be reasonably sure that the time it takes to
create a list range(1) and the overhead of the loop is *probably* quite
a small percentage of the time it takes to perform 10 vector
multiplications. But that's not a safe assumption for all code snippets.


Yes, it was one of those crazy things that Python used to have to do, 
creating a list of N numbers just in order to be able to count to N.


But that's not significant here. Either experience, or a preliminary 
test with an empty loop, or using xrange, or using Py3, will show that 
the loop overheads for N iterations in this case are small in comparison 
to executing the bodies of the loops.



This is why the timeit module exists: to do the right thing when it matters,
so that you don't have to think about whether or not it matters. The timeit
module works really really hard to get good quality, accurate timings,
minimizing any potential overhead.

The timeit module automates a bunch of tricky-to-right best practices for
timing code. Is that a problem?


The problem is it substitutes a bunch of tricky-to-get-right options and 
syntax which has to to typed /at the command line/. And you really don't 
want to have to write code at the command line (especially if sourced 
from elsewhere, which means you have to transcribe it).




But if you prefer doing it "old school" from within Python, then:

from timeit import Timer
t = Timer('np.cross(x, y)',  setup="""
import numpy as np
x = np.array([1, 2, 3])
y = np.array([4, 5, 6])
""")

# take five measurements of 10 calls each, and report the fastest
result = min(t.repeat(number=10, repeat=5))/10
print(result)  # time in seconds per call



Better?


A bit, but the code is now inside a string!

Code will normally exist as a proper part of a module, not on the 
command line, in a command history, or in a string, so why not test it 
running inside a module?


But I've done a lot of benchmarking and actually measuring execution 
time is just part of it. This test I ran from inside a function for 
example, not at module-level, as that is more typical.


Are the variables inside a time-it string globals or locals? It's just a 
lot of extra factors to worry about, and extra things to get wrong.


The loop timings used by the OP showed one took considerably longer than 
the other. And that was confirmed by others. There's nothing wrong with 
that method.


--
Bartc

--
https://mail.python.org/mailman/listinfo/python-list


Re: Numpy slow at vector cross product?

2016-11-21 Thread BartC

On 21/11/2016 17:04, Nobody wrote:

On Mon, 21 Nov 2016 14:53:35 +, BartC wrote:


Also that the critical bits were not implemented in Python?


That is correct. You'll notice that there aren't any loops in numpy.cross.
It's just a wrapper around a bunch of vectorised operations (*, -, []).

If you aren't taking advantage of vectorisation, there's no reason to
expect numpy to be any faster than primitive operations, any more than
you'd expect

(numpy.array([1]) + numpy.array([2])[0]

to be faster than "1+2".

Beyond that, you'd expect a generic function to be at a disadvantage
compared to a function which makes assumptions about its arguments.
Given what it does, I wouldn't expect numpy.cross() to be faster for
individual vectors if it was written in C.


The fastest I can get compiled, native code to do this is at 250 million 
cross-products per second.


The fastest using pure Python executed with Python 2.7 is 0.5 million 
per second.


With pypy, around 8 million per second. (Results will vary by machine, 
version, and OS so this is just one set of timings.)


So numpy, at 0.03 million per second [on a different OS and different 
version], has room for improvement I think!


(In all cases, the loop has been hobbled so that one component 
increments per loop, and one component of the result is summed and then 
displayed at the end.


This is to stop gcc, and partly pypy, from optimising the code out of 
existence; usually you are not calculating the same vector product 
repeatedly. Without the restraint, pypy leaps to 100 million per second, 
and gcc to an infinite number.)


The tests were with values assumed to be vectors, assumed to have 3 
components, and without any messing about with axes, whatever that code 
does. It's just a pure, streamlined, vector cross product (just as I use 
in my own code).


Such a streamlined version can also be written in Python. (Although it 
would be better with a dedicated 3-component vector type rather than a 
general purpose list or even numpy array.)


It's still a puzzle why directly executing the code that numpy uses was 
still faster than numpy itself, when both were run with CPython. Unless 
numpy is perhaps using extra wrappers around numpy.cross.


--
Bartc
--
https://mail.python.org/mailman/listinfo/python-list


Re: Numpy slow at vector cross product?

2016-11-21 Thread Nobody
On Mon, 21 Nov 2016 14:53:35 +, BartC wrote:

> Also that the critical bits were not implemented in Python?

That is correct. You'll notice that there aren't any loops in numpy.cross.
It's just a wrapper around a bunch of vectorised operations (*, -, []).

If you aren't taking advantage of vectorisation, there's no reason to
expect numpy to be any faster than primitive operations, any more than
you'd expect

(numpy.array([1]) + numpy.array([2])[0]

to be faster than "1+2".

Beyond that, you'd expect a generic function to be at a disadvantage
compared to a function which makes assumptions about its arguments.
Given what it does, I wouldn't expect numpy.cross() to be faster for
individual vectors if it was written in C.

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Numpy slow at vector cross product?

2016-11-21 Thread Skip Montanaro
Perhaps your implementation isn't as general as numpy's? I pulled out
the TestCross class from numpy.core.tests.test_numeric and replaced
calls to np.cross with calls to your function. I got an error in
test_broadcasting_shapes:

ValueError: operands could not be broadcast together with shapes (1,2) (5,)

Skip
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Numpy slow at vector cross product?

2016-11-21 Thread BartC

On 21/11/2016 12:44, Peter Otten wrote:


After a look into the source this is no longer a big surprise (numpy 1.8.2):

if axis is not None:
axisa, axisb, axisc=(axis,)*3
a = asarray(a).swapaxes(axisa, 0)
b = asarray(b).swapaxes(axisb, 0)




The situation may be different when you process vectors in bulk, i. e.
instead of [cross(a, bb) for bb in b] just say cross(a, b).


I thought that numpy was supposed to be fast? Also that the critical 
bits were not implemented in Python?


Anyway I tried your code (put into a function as shown below) in the 
same test program, and it was *still* 3 times as fast as numpy! 
(mycross() was 3 times as fast as np.cross().)


Explain that...


---

def mycross(a,b,axisa=-1, axisb=-1, axisc=-1, axis=None):
if axis is not None:
axisa, axisb, axisc=(axis,)*3
a = np.asarray(a).swapaxes(axisa, 0)
b = np.asarray(b).swapaxes(axisb, 0)
msg = "incompatible dimensions for cross product\n"\
  "(dimension must be 2 or 3)"
if (a.shape[0] not in [2, 3]) or (b.shape[0] not in [2, 3]):
raise ValueError(msg)
if a.shape[0] == 2:
if (b.shape[0] == 2):
cp = a[0]*b[1] - a[1]*b[0]
if cp.ndim == 0:
return cp
else:
return cp.swapaxes(0, axisc)
else:
x = a[1]*b[2]
y = -a[0]*b[2]
z = a[0]*b[1] - a[1]*b[0]
elif a.shape[0] == 3:
if (b.shape[0] == 3):
x = a[1]*b[2] - a[2]*b[1]
y = a[2]*b[0] - a[0]*b[2]
z = a[0]*b[1] - a[1]*b[0]
else:
x = -a[2]*b[1]
y = a[2]*b[0]
z = a[0]*b[1] - a[1]*b[0]
cp = np.array([x, y, z])
if cp.ndim == 1:
return cp
else:
return cp.swapaxes(0, axisc)

---
Tested as:

x=np.array([1,2,3])
y=np.array([4,5,6])

start=time.clock()
for i in range(loops):
z=mycross(x,y)
print "Calc, %s loops: %.2g seconds" %(loops,time.clock()-start)


--
Bartc
--
https://mail.python.org/mailman/listinfo/python-list


Re: Numpy slow at vector cross product?

2016-11-21 Thread Steve D'Aprano
On Mon, 21 Nov 2016 11:09 pm, BartC wrote:

> On 21/11/2016 02:48, Steve D'Aprano wrote:
[...]
>> However, your code is not a great way of timing code. Timing code is
>> *very* difficult, and can be effected by many things, such as external
>> processes, CPU caches, even the function you use for getting the time.
>> Much of the time you are timing here will be in creating the range(loops)
>> list, especially if loops is big.
> 
> But both loops are the same size. And that overhead can quickly be
> disposed of by measuring empty loops in both cases. (On my machine,
> about 0.006/7 seconds for loops of 100,000.)

No, you cannot make that assumption, not in general. On modern machines, you
cannot assume that the time it takes to execute foo() immediately followed
by bar() is the same as the time it takes to execute foo() and bar()
separately.

Modern machines run multi-tasking operating systems, where there can be
other processes running. Depending on what you use as your timer, you may
be measuring the time that those other processes run. The OS can cache
frequently used pieces of code, which allows it to run faster. The CPU
itself will cache some code.

The shorter the code snippet, the more these complications are relevant. In
this particular case, we can be reasonably sure that the time it takes to
create a list range(1) and the overhead of the loop is *probably* quite
a small percentage of the time it takes to perform 10 vector
multiplications. But that's not a safe assumption for all code snippets.

This is why the timeit module exists: to do the right thing when it matters,
so that you don't have to think about whether or not it matters. The timeit
module works really really hard to get good quality, accurate timings,
minimizing any potential overhead.

The timeit module automates a bunch of tricky-to-right best practices for
timing code. Is that a problem?



>> The best way to time small snippets of code is to use the timeit module.
>> Open a terminal or shell (*not* the Python interactive interpreter, the
>> operating system's shell: you should expect a $ or % prompt) and run
>> timeit from that. Copy and paste the following two commands into your
>> shell prompt:
>>
>>
>> python2.7 -m timeit --repeat 5 -s "import numpy as np" \
>> -s "x = np.array([1, 2, 3])" -s "y = np.array([4, 5, 6])" \
>> -- "np.cross(x, y)"
[...]

> Yes, I can see that typing all the code out again, and remembering all
> those options and putting -s, -- and \ in all the right places, is a
> much better way of doing it! Not error prone at all.

Gosh Bart, how did you manage to write that sentence? How did you remember
all those words, and remember to put the punctuation marks in the right
places?

You even used sarcasm! You must be a genius. (Oh look, I can use it too.)

Seriously Bart? You've been a programmer for how many decades, and you can't
work out how to call a command from the shell? This is about working
effectively with your tools, and a basic understanding of the shell is an
essential tool for programmers.

This was a *simple* command. It was a LONG command, but don't be fooled by
the length, and the fact that it went over multiple lines, it was dirt
simple. I'm not saying that every programmer needs to be a greybeard Unix
guru (heaven knows that I'm not!), but they ought to be able to run simple
commands from the command line.

Those who don't are in the same position as carpenters who don't know the
differences between the various kinds of hammer or saws. Sure, you can
still do a lot of work using just one kind of hammer and one kind of saw,
but you'll work better, faster and smarter with the right kind. You don't
use a rip saw to make fine cuts, and you don't use a mash hammer to drive
tacks.

The -m option lets you run a module without knowing the precise location of
the source file. Some of the most useful commands to learn:

python -m unittest ...
python -m doctest ...
python -m timeit ...

The biggest advantage of calling the timeit module is that it will
automatically select the number of iterations you need to run to get good
timing results, without wasting time running excessive loops.

(The timeit module is *significantly* improved in Python 3.6, but even in
older versions its pretty good.)

But if you prefer doing it "old school" from within Python, then:

from timeit import Timer
t = Timer('np.cross(x, y)',  setup="""
import numpy as np
x = np.array([1, 2, 3])
y = np.array([4, 5, 6])
""")

# take five measurements of 10 calls each, and report the fastest
result = min(t.repeat(number=10, repeat=5))/10
print(result)  # time in seconds per call



Better?




-- 
Steve
“Cheer up,” they said, “things could be worse.” So I cheered up, and sure
enough, things got worse.

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Numpy slow at vector cross product?

2016-11-21 Thread eryk sun
On Mon, Nov 21, 2016 at 1:38 AM, BartC  wrote:
> On 20/11/2016 20:46, DFS wrote:
>>
>> import sys, time, numpy as np
>> loops=int(sys.argv[1])
>>
>> x=np.array([1,2,3])
>> y=np.array([4,5,6])
>> start=time.clock()

In Unix, time.clock doesn't measure wall-clock time, but rather an
approximation to the CPU time used by the current process. On the
other hand, time.time calls gettimeofday, if available, which has a
resolution of 1 microsecond. Python 2 timing tests should use
time.time on Unix.

In Windows, time.time calls GetSystemTimeAsFileTime, which has a
default resolution of only about 15 ms, adjustable down to about 1 ms.
On other hand, time.clock calls QueryPerformanceCounter, which has a
resolution of about 100 nanoseconds. Python 2 timing tests should use
time.clock on Windows.

In Python 3.3+, timing tests should use time.perf_counter. In Linux
this calls clock_gettime using a monotonic clock with a resolution of
1 nanosecond, and in Windows it calls QueryPerformanceCounter.

In any case, timeit.default_timer selects the best function to call
for a given platform.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Numpy slow at vector cross product?

2016-11-21 Thread Peter Otten
Steve D'Aprano wrote:

> On Mon, 21 Nov 2016 07:46 am, DFS wrote:
> 
>> import sys, time, numpy as np
>> loops=int(sys.argv[1])
>> 
>> x=np.array([1,2,3])
>> y=np.array([4,5,6])
>> start=time.clock()
>> for i in range(loops):
>>  np.cross(x,y)
>> print "Numpy, %s loops: %.2g seconds" %(loops,time.clock()-start)
> 
> [...]
>> $ python vector_cross.py
>> Numpy, 10 loops: 2.5 seconds
>> Calc,  10 loops: 0.13 seconds
>> 
>> 
>> Did I do something wrong, or is numpy slow at this?
> 
> I can confirm similar results.
> 
> However, your code is not a great way of timing code. Timing code is
> *very* difficult, and can be effected by many things, such as external
> processes, CPU caches, even the function you use for getting the time.
> Much of the time you are timing here will be in creating the range(loops)
> list, especially if loops is big.
> 
> The best way to time small snippets of code is to use the timeit module.
> Open a terminal or shell (*not* the Python interactive interpreter, the
> operating system's shell: you should expect a $ or % prompt) and run
> timeit from that. Copy and paste the following two commands into your
> shell prompt:
> 
> 
> python2.7 -m timeit --repeat 5 -s "import numpy as np" \
> -s "x = np.array([1, 2, 3])" -s "y = np.array([4, 5, 6])" \
> -- "np.cross(x, y)"
> 
> 
> python2.7 -m timeit --repeat 5 -s "x = [1, 2, 3]" \
> -s "y = [4, 5, 6]" -s "z = [0, 0, 0]" \
> -- "z[0] = x[1]*y[2] - x[2]*y[1]; z[1] = x[2]*y[0] - \
> x[0]*y[2]; z[2] = x[0]*y[1] - x[1]*y[0]"
> 
> 
> The results I get are:
> 
> 1 loops, best of 5: 30 usec per loop
> 
> 100 loops, best of 5: 1.23 usec per loop
> 
> 
> So on my machine, np.cross() is about 25 times slower than multiplying by
> hand.

After a look into the source this is no longer a big surprise (numpy 1.8.2):

if axis is not None:
axisa, axisb, axisc=(axis,)*3
a = asarray(a).swapaxes(axisa, 0)
b = asarray(b).swapaxes(axisb, 0)
msg = "incompatible dimensions for cross product\n"\
  "(dimension must be 2 or 3)"
if (a.shape[0] not in [2, 3]) or (b.shape[0] not in [2, 3]):
raise ValueError(msg)
if a.shape[0] == 2:
if (b.shape[0] == 2):
cp = a[0]*b[1] - a[1]*b[0]
if cp.ndim == 0:
return cp
else:
return cp.swapaxes(0, axisc)
else:
x = a[1]*b[2]
y = -a[0]*b[2]
z = a[0]*b[1] - a[1]*b[0]
elif a.shape[0] == 3:
if (b.shape[0] == 3):
x = a[1]*b[2] - a[2]*b[1]
y = a[2]*b[0] - a[0]*b[2]
z = a[0]*b[1] - a[1]*b[0]
else:
x = -a[2]*b[1]
y = a[2]*b[0]
z = a[0]*b[1] - a[1]*b[0]
cp = array([x, y, z])
if cp.ndim == 1:
return cp
else:
return cp.swapaxes(0, axisc)

The situation may be different when you process vectors in bulk, i. e. 
instead of [cross(a, bb) for bb in b] just say cross(a, b).


-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Numpy slow at vector cross product?

2016-11-21 Thread BartC

On 21/11/2016 02:48, Steve D'Aprano wrote:

On Mon, 21 Nov 2016 07:46 am, DFS wrote:



start=time.clock()
for i in range(loops):
 np.cross(x,y)
print "Numpy, %s loops: %.2g seconds" %(loops,time.clock()-start)



However, your code is not a great way of timing code. Timing code is *very*
difficult, and can be effected by many things, such as external processes,
CPU caches, even the function you use for getting the time. Much of the
time you are timing here will be in creating the range(loops) list,
especially if loops is big.


But both loops are the same size. And that overhead can quickly be 
disposed of by measuring empty loops in both cases. (On my machine, 
about 0.006/7 seconds for loops of 100,000.)



The best way to time small snippets of code is to use the timeit module.
Open a terminal or shell (*not* the Python interactive interpreter, the
operating system's shell: you should expect a $ or % prompt) and run timeit
from that. Copy and paste the following two commands into your shell
prompt:


python2.7 -m timeit --repeat 5 -s "import numpy as np" \
-s "x = np.array([1, 2, 3])" -s "y = np.array([4, 5, 6])" \
-- "np.cross(x, y)"


python2.7 -m timeit --repeat 5 -s "x = [1, 2, 3]" \
-s "y = [4, 5, 6]" -s "z = [0, 0, 0]" \
-- "z[0] = x[1]*y[2] - x[2]*y[1]; z[1] = x[2]*y[0] - \
x[0]*y[2]; z[2] = x[0]*y[1] - x[1]*y[0]"


Yes, I can see that typing all the code out again, and remembering all 
those options and putting -s, -- and \ in all the right places, is a 
much better way of doing it! Not error prone at all.


--
bartc
--
https://mail.python.org/mailman/listinfo/python-list


Re: Numpy slow at vector cross product?

2016-11-20 Thread Steve D'Aprano
On Mon, 21 Nov 2016 07:46 am, DFS wrote:

> import sys, time, numpy as np
> loops=int(sys.argv[1])
> 
> x=np.array([1,2,3])
> y=np.array([4,5,6])
> start=time.clock()
> for i in range(loops):
>  np.cross(x,y)
> print "Numpy, %s loops: %.2g seconds" %(loops,time.clock()-start)

[...]
> $ python vector_cross.py
> Numpy, 10 loops: 2.5 seconds
> Calc,  10 loops: 0.13 seconds
> 
> 
> Did I do something wrong, or is numpy slow at this?

I can confirm similar results.

However, your code is not a great way of timing code. Timing code is *very*
difficult, and can be effected by many things, such as external processes,
CPU caches, even the function you use for getting the time. Much of the
time you are timing here will be in creating the range(loops) list,
especially if loops is big.

The best way to time small snippets of code is to use the timeit module.
Open a terminal or shell (*not* the Python interactive interpreter, the
operating system's shell: you should expect a $ or % prompt) and run timeit
from that. Copy and paste the following two commands into your shell
prompt:


python2.7 -m timeit --repeat 5 -s "import numpy as np" \
-s "x = np.array([1, 2, 3])" -s "y = np.array([4, 5, 6])" \
-- "np.cross(x, y)"


python2.7 -m timeit --repeat 5 -s "x = [1, 2, 3]" \
-s "y = [4, 5, 6]" -s "z = [0, 0, 0]" \
-- "z[0] = x[1]*y[2] - x[2]*y[1]; z[1] = x[2]*y[0] - \
x[0]*y[2]; z[2] = x[0]*y[1] - x[1]*y[0]"


The results I get are:

1 loops, best of 5: 30 usec per loop

100 loops, best of 5: 1.23 usec per loop


So on my machine, np.cross() is about 25 times slower than multiplying by
hand.





-- 
Steve
“Cheer up,” they said, “things could be worse.” So I cheered up, and sure
enough, things got worse.

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Numpy slow at vector cross product?

2016-11-20 Thread BartC

On 20/11/2016 20:46, DFS wrote:

import sys, time, numpy as np
loops=int(sys.argv[1])

x=np.array([1,2,3])
y=np.array([4,5,6])
start=time.clock()
for i in range(loops):
np.cross(x,y)
print "Numpy, %s loops: %.2g seconds" %(loops,time.clock()-start)

x=[1,2,3]
y=[4,5,6]
z=[0,0,0]
start=time.clock()
for i in range(loops):
z[0]=x[1]*y[2]-x[2]*y[1];
z[1]=x[2]*y[0]-x[0]*y[2];
z[2]=x[0]*y[1]-x[1]*y[0];
print "Calc, %s loops: %.2g seconds" %(loops,time.clock()-start)


I don't know why numpy is slow, but I can confirm similar results.

In fact if the workings are put into a function, then the difference is 
even more marked, with the normal calculation being 50% faster


Maybe numpy has extra overheads, and the arrays being operated on are 
very small, but even so, 30 times slower than CPython? (2.5 to 0.083 
seconds.)


--
Bartc




--
https://mail.python.org/mailman/listinfo/python-list