Re: Numpy slow at vector cross product?
On 22/11/2016 16:48, Steve D'Aprano wrote: On Tue, 22 Nov 2016 11:45 pm, BartC wrote: I will have a look. Don't forget however that all someone is trying to do is to multiply two vectors. They're not interested in axes transformation or making them broadcastable, whatever that means. You don't know that. When I did vector arithmetic at school, the formula for the cross-product of (x1,y1,z1) with (x2,y2,z2) was quite straightforward. This is what you expect given a library function to do the job. Especially in an implementation that could be crippled by running as unoptimised byte-code. If someone was given the task of cross-multiplying two three-element vectors, the formula in the text-book (or Wikipedia) is what they would write. Could numpy optimize the single pair of vectors case a bit better? Perhaps they could. It looks like there's a bunch of minor improvements which could be made to the code, a few micro-optimizations that shave a microsecond or two off the execution time. And maybe they could even detect the case where the arguments are a single pair of vectors, and optimize that. Even replacing it with a naive pure-Python cross product would be a big win. The code it ends up executing for individual pairs of vectors is a joke. 97% of what it does is nothing to do with the cross-product! But for the big array of vectors case, you absolutely have to support doing fast vectorized cross-products over a huge number of vectors. That's how I expected numpy to work. I said something along those lines in my first post: BC: > Maybe numpy has extra overheads, and the arrays being operated on are > very small, but even so, 30 times slower than CPython? (2.5 to 0.083 > seconds.) py> a = np.array([ ... [1, 2, 3], ... [4, 5, 6], ... [7, 8, 9], ... ]*1000 ... ) py> b = np.array([ ... [9, 8, 7], ... [6, 5, 4], ... [3, 2, 1], ... ]*1000 ... ) py> a.shape (3000, 3) py> result = np.cross(a, b) py> result.shape (3000, 3) On my computer, numpy took only 10 times longer to cross-multiply 3000 pairs of vectors than it took to cross-multiply a single pair of vectors. If I did that in pure Python, it would take 3000 times longer, or more, so numpy wins here by a factor of 300. I tested this with 3 million pairs on my machine. It managed 8 million products per second (about the same as the simplest Python code, but run with pypy), taking nearly 0.4 seconds. Except it took over 4 seconds to create the special numpy arrays! Setting up ordinary arrays was much faster, but then the cross-product calculation was slower. -- bartc -- https://mail.python.org/mailman/listinfo/python-list
Re: Numpy slow at vector cross product?
On Tue, 22 Nov 2016 11:45 pm, BartC wrote: > I will have a look. Don't forget however that all someone is trying to > do is to multiply two vectors. They're not interested in axes > transformation or making them broadcastable, whatever that means. You don't know that. Bart, you have a rather disagreeable tendency towards assuming that if you don't want to do something, then nobody else could possibly want to do it. You should try to keep an open mind to the possibility that perhaps there are use-cases that you didn't think of. numpy is not *the* most heavily used third-party library in the Python ecosystem because its slow. numpy is a library for doing vectorised operations over massive arrays. You're thinking of numpy users doing the cross product of two vectors, but you should be thinking of numpy users doing the cross product of a million pairs of vectors. Could numpy optimize the single pair of vectors case a bit better? Perhaps they could. It looks like there's a bunch of minor improvements which could be made to the code, a few micro-optimizations that shave a microsecond or two off the execution time. And maybe they could even detect the case where the arguments are a single pair of vectors, and optimize that. Even replacing it with a naive pure-Python cross product would be a big win. But for the big array of vectors case, you absolutely have to support doing fast vectorized cross-products over a huge number of vectors. py> a = np.array([ ... [1, 2, 3], ... [4, 5, 6], ... [7, 8, 9], ... ]*1000 ... ) py> b = np.array([ ... [9, 8, 7], ... [6, 5, 4], ... [3, 2, 1], ... ]*1000 ... ) py> a.shape (3000, 3) py> result = np.cross(a, b) py> result.shape (3000, 3) On my computer, numpy took only 10 times longer to cross-multiply 3000 pairs of vectors than it took to cross-multiply a single pair of vectors. If I did that in pure Python, it would take 3000 times longer, or more, so numpy wins here by a factor of 300. -- Steve “Cheer up,” they said, “things could be worse.” So I cheered up, and sure enough, things got worse. -- https://mail.python.org/mailman/listinfo/python-list
Re: Numpy slow at vector cross product?
On 22/11/2016 12:45, BartC wrote: On 22/11/2016 12:34, Skip Montanaro wrote: I'm simply suggesting there is plenty of room for improvement. I even showed a version that did *exactly* what numpy does (AFAIK) that was three times the speed of numpy even executed by CPython. So there is some mystery there. As I indicated in my earlier response, your version doesn't pass all of numpy's cross product unit tests. Fix that and submit a patch to the numpy maintainers. I suspect it would be accepted. I saw your response but didn't understand it. My code was based around what Peter Otten posted from numpy sources. I will have a look. Don't forget however that all someone is trying to do is to multiply two vectors. They're not interested in axes transformation or making them broadcastable, whatever that means. It seems the posted code of numpy.cross wasn't complete. The full code is below (I've removed the doc string, but have had to add some 'np.' prefixes as it is now out of context). If anyone is still wondering why numpy.cross is slow, then no further comments are necessary! -- def cross(a, b, axisa=-1, axisb=-1, axisc=-1, axis=None): if axis is not None: axisa, axisb, axisc = (axis,) * 3 a = np.asarray(a) b = np.asarray(b) # Check axisa and axisb are within bounds axis_msg = "'axis{0}' out of bounds" if axisa < -a.ndim or axisa >= a.ndim: raise ValueError(axis_msg.format('a')) if axisb < -b.ndim or axisb >= b.ndim: raise ValueError(axis_msg.format('b')) # Move working axis to the end of the shape a = np.rollaxis(a, axisa, a.ndim) b = np.rollaxis(b, axisb, b.ndim) msg = ("incompatible dimensions for cross product\n" "(dimension must be 2 or 3)") if a.shape[-1] not in (2, 3) or b.shape[-1] not in (2, 3): raise ValueError(msg) # Create the output array shape = np.broadcast(a[..., 0], b[..., 0]).shape if a.shape[-1] == 3 or b.shape[-1] == 3: shape += (3,) # Check axisc is within bounds if axisc < -len(shape) or axisc >= len(shape): raise ValueError(axis_msg.format('c')) dtype = np.promote_types(a.dtype, b.dtype) cp = np.empty(shape, dtype) # create local aliases for readability a0 = a[..., 0] a1 = a[..., 1] if a.shape[-1] == 3: a2 = a[..., 2] b0 = b[..., 0] b1 = b[..., 1] if b.shape[-1] == 3: b2 = b[..., 2] if cp.ndim != 0 and cp.shape[-1] == 3: cp0 = cp[..., 0] cp1 = cp[..., 1] cp2 = cp[..., 2] if a.shape[-1] == 2: if b.shape[-1] == 2: # a0 * b1 - a1 * b0 multiply(a0, b1, out=cp) cp -= a1 * b0 return cp else: assert b.shape[-1] == 3 # cp0 = a1 * b2 - 0 (a2 = 0) # cp1 = 0 - a0 * b2 (a2 = 0) # cp2 = a0 * b1 - a1 * b0 np.multiply(a1, b2, out=cp0) np.multiply(a0, b2, out=cp1) np.negative(cp1, out=cp1) np.multiply(a0, b1, out=cp2) cp2 -= a1 * b0 else: assert a.shape[-1] == 3 if b.shape[-1] == 3: # cp0 = a1 * b2 - a2 * b1 # cp1 = a2 * b0 - a0 * b2 # cp2 = a0 * b1 - a1 * b0 np.multiply(a1, b2, out=cp0) tmp = np.array(a2 * b1) cp0 -= tmp np.multiply(a2, b0, out=cp1) np.multiply(a0, b2, out=tmp) cp1 -= tmp np.multiply(a0, b1, out=cp2) np.multiply(a1, b0, out=tmp) cp2 -= tmp else: assert b.shape[-1] == 2 # cp0 = 0 - a2 * b1 (b2 = 0) # cp1 = a2 * b0 - 0 (b2 = 0) # cp2 = a0 * b1 - a1 * b0 np.multiply(a2, b1, out=cp0) np.negative(cp0, out=cp0) np.multiply(a2, b0, out=cp1) np.multiply(a0, b1, out=cp2) cp2 -= a1 * b0 # This works because we are moving the last axis return np.rollaxis(cp, -1, axisc) -- Bartc -- https://mail.python.org/mailman/listinfo/python-list
Re: Numpy slow at vector cross product?
On Tue, Nov 22, 2016 at 1:06 PM, BartC wrote: >> In this specific example, the OP is comparing two radically different >> pieces of code that clearly and obviously perform differently. He's doing >> the equivalent of timing the code with his heartbeat, and getting 50 beats >> for one and 150 beats for the other. That's good enough to show gross >> differences in performance. > > No, he's using time.clock() with, presumably, a consistent number of ticks > per second. Note that some people in this discussion use Unix systems, on which using time.clock for wall-clock timing is completely wrong. Please use timeit.default_timer to ensure that examples are portable. -- https://mail.python.org/mailman/listinfo/python-list
Re: Numpy slow at vector cross product?
On 22/11/2016 03:00, Steve D'Aprano wrote: On Tue, 22 Nov 2016 12:45 pm, BartC wrote: You get to know after while what kinds of processes affect timings. For example, streaming a movie at the same time. Really, no. py> with Stopwatch(): ... x = math.sin(1.234) ... elapsed time is very small; consider using the timeit module for micro-timings of small code snippets time taken: 0.007164 seconds I tried 'Stopwatch()' and it didn't understand what it was. And again: py> with Stopwatch(): ... x = math.sin(1.234) ... elapsed time is very small; consider using the timeit module for micro-timings of small code snippets time taken: 0.14 seconds Look at the variation in the timing: 0.007164 versus 0.14 second. That's the influence of a cache, or more than one cache, somewhere. But if I run it again: py> with Stopwatch(): ... x = math.sin(1.234) ... elapsed time is very small; consider using the timeit module for micro-timings of small code snippets time taken: 0.13 seconds there's a smaller variation, this time "only" 7%, for code which hasn't changed. That's what your up against. I tried this: import math def fn(): for i in xrange(300): x=math.sin(1.234) print x fn() If I run this, my IDE tells me the whole thing took 0.93 seconds. That's including the overheads of the IDE, invoking python.exe, Python's start-up time, the function call, the loop overheads, and whatever is involved in shutting it all down again. Now I replace the x=math.sin() line with pass. The IDE now says 0.21 seconds. The only thing that's changed is the sin() call. I can probably deduce that executing x=math.sin(1.2340) three million times took 0.72 seconds (with some expected variation). So over 4 million times per second. Which is interesting because your timings with Stopwatch varied from 140 to 71000 per second (and I doubt your machine is that much slower than mine). [steve@ando ~]$ python2.7 -m timeit -s "x = 257" "3*x" 1000 loops, best of 3: 0.106 usec per loop [steve@ando ~]$ python3.5 -m timeit -s "x = 257" "3*x" 1000 loops, best of 3: 0.137 usec per loop That's *brilliant* and much simpler than anything you are doing with loops and clocks and whatnot. But you lose all context. And there is a limit to what you can do with such micro-benchmarks. Sometimes you want to time a real task, but with one line substituted for another, or some other minor rearrangement. Or perhaps the overall time of a code fragment depends on the mix of data it has to deal with. Code will normally exist as a proper part of a module, not on the command line, in a command history, or in a string, so why not test it running inside a module? Sure, you can do that, if you want potentially inaccurate results. When your customer runs your application (and complains about how long it takes), they will be running the code as normal not inside a string or on the command line! In this specific example, the OP is comparing two radically different pieces of code that clearly and obviously perform differently. He's doing the equivalent of timing the code with his heartbeat, and getting 50 beats for one and 150 beats for the other. That's good enough to show gross differences in performance. No, he's using time.clock() with, presumably, a consistent number of ticks per second. But often you're comparing two code snippets which are very nearly the same, and trying to tease out a real difference of (say) 3% out of a noisy signal where each run may differ by 10% just from randomness. Using your heartbeat to time code is not going to do it. Yes, exactly, that's why typing a code fragment on the command line is a waste of time. You need to design a test where the differences are going to be more obvious. -- Bartc -- https://mail.python.org/mailman/listinfo/python-list
Re: Numpy slow at vector cross product?
On 22/11/2016 12:34, Skip Montanaro wrote: I'm simply suggesting there is plenty of room for improvement. I even showed a version that did *exactly* what numpy does (AFAIK) that was three times the speed of numpy even executed by CPython. So there is some mystery there. As I indicated in my earlier response, your version doesn't pass all of numpy's cross product unit tests. Fix that and submit a patch to the numpy maintainers. I suspect it would be accepted. I saw your response but didn't understand it. My code was based around what Peter Otten posted from numpy sources. I will have a look. Don't forget however that all someone is trying to do is to multiply two vectors. They're not interested in axes transformation or making them broadcastable, whatever that means. So making numpy.cross do all that may simply be too big a cost. -- bartc -- https://mail.python.org/mailman/listinfo/python-list
Re: Numpy slow at vector cross product?
> I'm simply suggesting there is plenty of room for improvement. I even showed a version that did *exactly* what numpy does (AFAIK) that was three times the speed of numpy even executed by CPython. So there is some mystery there. As I indicated in my earlier response, your version doesn't pass all of numpy's cross product unit tests. Fix that and submit a patch to the numpy maintainers. I suspect it would be accepted. Skip -- https://mail.python.org/mailman/listinfo/python-list
Re: Numpy slow at vector cross product?
On 22/11/2016 02:44, Steve D'Aprano wrote: On Tue, 22 Nov 2016 05:43 am, BartC wrote: The fastest I can get compiled, native code to do this is at 250 million cross-products per second. (Actually 300 million using 64-bit code.) Yes, yes, you're awfully clever, and your secret private language is so much more efficient than even C that the entire IT industry ought to hang their head in shame. The 250M/300M timing was using C and gcc-O3. This gives an indication of the upper limit of what is possible. The suggestion was that the numpy solution, which was *one thousand time slower* than these figures, wouldn't be any faster if written in C. I'm simply suggesting there is plenty of room for improvement. I even showed a version that did *exactly* what numpy does (AFAIK) that was three times the speed of numpy even executed by CPython. So there is some mystery there. (FWIW my own compiled language manages 70M, and my interpreted language up to 3M. Poor, so I might look at them again next time I have to do loads of vector-products.) I'm only being *half* sarcastic here, for what its worth. I remember the days when I could fit an entire operating system, plus applications, on a 400K floppy disk, and they would run at acceptable speed on something like an 8 MHz CPU. Code used to be more efficient, with less overhead. But given that your magic compiler runs only on one person's PC in the entire world, it is completely irrelevant. So you're saying that numpy's speed is perfectly reasonable and we should do nothing about it? Because, after all, it does a few extra clever things (whether the programmer wants them or not!). And the way it is written is completely beyond reproach: msg = "incompatible dimensions for cross product\n"\ "(dimension must be 2 or 3)" if (a.shape[0] not in [2, 3]) or (b.shape[0] not in [2, 3]): raise ValueError(msg) So it is fine for msg to be bound to a string value here (and having to unbind it again when it returns) even though it will never be used again in a normal call. (Or having to test both a.shape[0] and b.shape[0] for inclusion in a list, expensive looking operations, when they are tested again anyway in the next few lines.) -- Bartc -- https://mail.python.org/mailman/listinfo/python-list
Re: Numpy slow at vector cross product?
Steven D'Aprano writes: > if we knew we should be doing it, and if we could be bothered to run > multiple trials and gather statistics and keep a close eye on the > deviation between measurements. But who wants to do that by hand? You might like this, for Haskell: http://www.serpentine.com/criterion/tutorial.html I've sometimes thought of wrapping it around other languages. Make sure to click on the graphs: it's impressive work. -- https://mail.python.org/mailman/listinfo/python-list
Re: Numpy slow at vector cross product?
On Tuesday 22 November 2016 14:00, Steve D'Aprano wrote: > Running a whole lot of loops can, sometimes, mitigate some of that > variation, but not always. Even when running in a loop, you can easily get > variation of 10% or more just at random. I think that needs to be emphasised: there's a lot of random noise in these measurements. For big, heavyweight functions that do a lot of work, the noise is generally a tiny proportion, and you can safely ignore it. (At least for CPU bound tasks: I/O bound tasks, the noise in I/O is potentially very high.) For really tiny operations, the noise *may* be small, depending on the operation. But small is not insignificant. Consider a simple operation like addition: # Python 3.5 import statistics from timeit import Timer t = Timer("x + 1", setup="x = 0") # ten trials, of one million loops each results = t.repeat(repeat=10) best = min(results) average = statistics.mean(results) std_error = statistics.stdev(results)/statistics.mean(results) Best: 0.09761243686079979 Average: 0.0988507878035307 Std error: 0.02260956789268462 So this suggests that on my machine, doing no expensive virus scans or streaming video, the random noise in something as simple as integer addition is around two percent. So that's your baseline: even simple operations repeated thousands of times will show random noise of a few percent. Consequently, if you're doing one trial (one loop of, say, a million operations): start = time.time() for i in range(100): x + 1 elapsed = time.time() - start and compare the time taken with another trial, and the difference is of the order of a few percentage points, then you have *no* reason to believe the result is real. You ought to repeat your test multiple times -- the more the better. timeit makes it easy to repeat your tests. It automatically picks the best timer for your platform and avoid serious gotchas from using the wrong timer. When called from the command line, it will automatically select the best number of loops to ensure reliable timing, without wasting time doing more loops than needed. timeit isn't magic. It's not doing anything that you or I couldn't do by hand, if we knew we should be doing it, and if we could be bothered to run multiple trials and gather statistics and keep a close eye on the deviation between measurements. But who wants to do that by hand? -- Steven 299792.458 km/s — not just a good idea, it’s the law! -- https://mail.python.org/mailman/listinfo/python-list
Re: Numpy slow at vector cross product?
On Tue, 22 Nov 2016 12:45 pm, BartC wrote: > On 21/11/2016 14:50, Steve D'Aprano wrote: >> On Mon, 21 Nov 2016 11:09 pm, BartC wrote: > >> Modern machines run multi-tasking operating systems, where there can be >> other processes running. Depending on what you use as your timer, you may >> be measuring the time that those other processes run. The OS can cache >> frequently used pieces of code, which allows it to run faster. The CPU >> itself will cache some code. > > You get to know after while what kinds of processes affect timings. For > example, streaming a movie at the same time. Really, no. You'll just have to take my word on this, but I'm not streaming any movies at the moment. I don't even have a web browser running. And since I'm running Linux, I don't have an anti-virus scanner that might have just triggered a scan. (But since I'm running Linux, I do have a web server, mail server, a DNS server, cron, and about 300 other processes running, any of which might start running for a microsecond or ten in the middle of a job.) py> with Stopwatch(): ... x = math.sin(1.234) ... elapsed time is very small; consider using the timeit module for micro-timings of small code snippets time taken: 0.007164 seconds And again: py> with Stopwatch(): ... x = math.sin(1.234) ... elapsed time is very small; consider using the timeit module for micro-timings of small code snippets time taken: 0.14 seconds Look at the variation in the timing: 0.007164 versus 0.14 second. That's the influence of a cache, or more than one cache, somewhere. But if I run it again: py> with Stopwatch(): ... x = math.sin(1.234) ... elapsed time is very small; consider using the timeit module for micro-timings of small code snippets time taken: 0.13 seconds there's a smaller variation, this time "only" 7%, for code which hasn't changed. That's what your up against. Running a whole lot of loops can, sometimes, mitigate some of that variation, but not always. Even when running in a loop, you can easily get variation of 10% or more just at random. > So when you need to compare timings, you turn those off. > >> The shorter the code snippet, the more these complications are relevant. >> In this particular case, we can be reasonably sure that the time it takes >> to create a list range(1) and the overhead of the loop is *probably* >> quite a small percentage of the time it takes to perform 10 vector >> multiplications. But that's not a safe assumption for all code snippets. > > Yes, it was one of those crazy things that Python used to have to do, > creating a list of N numbers just in order to be able to count to N. Doesn't matter. Even with xrange, you're still counting the cost of looking up xrange, passing one or more arguments to it, parsing those arguments, creating an xrange object, and iterating over that xrange object repeatedly. None of those things are free. You might *hope* that the cost of those things are insignificant compared to what you're actually interested in timing, but you don't know. And you're resisting the idea of using a tool that is specifically designed to avoid measuring all that overhead. It's okay that your intuitions about the cost of executing Python code is inaccurate. What's not okay is your refusal to listen to those who have a better idea of what's involved. [...] >> The timeit module automates a bunch of tricky-to-right best practices for >> timing code. Is that a problem? > > The problem is it substitutes a bunch of tricky-to-get-right options and > syntax which has to to typed /at the command line/. And you really don't > want to have to write code at the command line (especially if sourced > from elsewhere, which means you have to transcribe it). You have to transcribe it no matter what you do. Unless you are given correctly written timing code. You don't have to use timeit from the command line. But you're mad if you don't: the smaller the code snippet, the more convenient it is. [steve@ando ~]$ python2.7 -m timeit -s "x = 257" "3*x" 1000 loops, best of 3: 0.106 usec per loop [steve@ando ~]$ python3.5 -m timeit -s "x = 257" "3*x" 1000 loops, best of 3: 0.137 usec per loop That's *brilliant* and much simpler than anything you are doing with loops and clocks and whatnot. Its simple, straightforward, and tells me exactly what I expected to see. (Python 3.6 will be even better.) For the record, the reason Python 3.5 is so much slower here is because it is a debugging build. >> But if you prefer doing it "old school" from within Python, then: >> >> from timeit import Timer >> t = Timer('np.cross(x, y)', setup=""" >> import numpy as np >> x = np.array([1, 2, 3]) >> y = np.array([4, 5, 6]) >> """) >> >> # take five measurements of 10 calls each, and report the fastest >> result = min(t.repeat(number=10, repeat=5))/10 >> print(result) # time in seconds per call > >> Better? > > A bit, but the code is now inside a string! As oppo
Re: Numpy slow at vector cross product?
On Tue, 22 Nov 2016 05:43 am, BartC wrote: > The fastest I can get compiled, native code to do this is at 250 million > cross-products per second. Yes, yes, you're awfully clever, and your secret private language is so much more efficient than even C that the entire IT industry ought to hang their head in shame. I'm only being *half* sarcastic here, for what its worth. I remember the days when I could fit an entire operating system, plus applications, on a 400K floppy disk, and they would run at acceptable speed on something like an 8 MHz CPU. Code used to be more efficient, with less overhead. But given that your magic compiler runs only on one person's PC in the entire world, it is completely irrelevant. -- Steve “Cheer up,” they said, “things could be worse.” So I cheered up, and sure enough, things got worse. -- https://mail.python.org/mailman/listinfo/python-list
Re: Numpy slow at vector cross product?
On 21/11/2016 14:50, Steve D'Aprano wrote: On Mon, 21 Nov 2016 11:09 pm, BartC wrote: Modern machines run multi-tasking operating systems, where there can be other processes running. Depending on what you use as your timer, you may be measuring the time that those other processes run. The OS can cache frequently used pieces of code, which allows it to run faster. The CPU itself will cache some code. You get to know after while what kinds of processes affect timings. For example, streaming a movie at the same time. So when you need to compare timings, you turn those off. The shorter the code snippet, the more these complications are relevant. In this particular case, we can be reasonably sure that the time it takes to create a list range(1) and the overhead of the loop is *probably* quite a small percentage of the time it takes to perform 10 vector multiplications. But that's not a safe assumption for all code snippets. Yes, it was one of those crazy things that Python used to have to do, creating a list of N numbers just in order to be able to count to N. But that's not significant here. Either experience, or a preliminary test with an empty loop, or using xrange, or using Py3, will show that the loop overheads for N iterations in this case are small in comparison to executing the bodies of the loops. This is why the timeit module exists: to do the right thing when it matters, so that you don't have to think about whether or not it matters. The timeit module works really really hard to get good quality, accurate timings, minimizing any potential overhead. The timeit module automates a bunch of tricky-to-right best practices for timing code. Is that a problem? The problem is it substitutes a bunch of tricky-to-get-right options and syntax which has to to typed /at the command line/. And you really don't want to have to write code at the command line (especially if sourced from elsewhere, which means you have to transcribe it). But if you prefer doing it "old school" from within Python, then: from timeit import Timer t = Timer('np.cross(x, y)', setup=""" import numpy as np x = np.array([1, 2, 3]) y = np.array([4, 5, 6]) """) # take five measurements of 10 calls each, and report the fastest result = min(t.repeat(number=10, repeat=5))/10 print(result) # time in seconds per call Better? A bit, but the code is now inside a string! Code will normally exist as a proper part of a module, not on the command line, in a command history, or in a string, so why not test it running inside a module? But I've done a lot of benchmarking and actually measuring execution time is just part of it. This test I ran from inside a function for example, not at module-level, as that is more typical. Are the variables inside a time-it string globals or locals? It's just a lot of extra factors to worry about, and extra things to get wrong. The loop timings used by the OP showed one took considerably longer than the other. And that was confirmed by others. There's nothing wrong with that method. -- Bartc -- https://mail.python.org/mailman/listinfo/python-list
Re: Numpy slow at vector cross product?
On 21/11/2016 17:04, Nobody wrote: On Mon, 21 Nov 2016 14:53:35 +, BartC wrote: Also that the critical bits were not implemented in Python? That is correct. You'll notice that there aren't any loops in numpy.cross. It's just a wrapper around a bunch of vectorised operations (*, -, []). If you aren't taking advantage of vectorisation, there's no reason to expect numpy to be any faster than primitive operations, any more than you'd expect (numpy.array([1]) + numpy.array([2])[0] to be faster than "1+2". Beyond that, you'd expect a generic function to be at a disadvantage compared to a function which makes assumptions about its arguments. Given what it does, I wouldn't expect numpy.cross() to be faster for individual vectors if it was written in C. The fastest I can get compiled, native code to do this is at 250 million cross-products per second. The fastest using pure Python executed with Python 2.7 is 0.5 million per second. With pypy, around 8 million per second. (Results will vary by machine, version, and OS so this is just one set of timings.) So numpy, at 0.03 million per second [on a different OS and different version], has room for improvement I think! (In all cases, the loop has been hobbled so that one component increments per loop, and one component of the result is summed and then displayed at the end. This is to stop gcc, and partly pypy, from optimising the code out of existence; usually you are not calculating the same vector product repeatedly. Without the restraint, pypy leaps to 100 million per second, and gcc to an infinite number.) The tests were with values assumed to be vectors, assumed to have 3 components, and without any messing about with axes, whatever that code does. It's just a pure, streamlined, vector cross product (just as I use in my own code). Such a streamlined version can also be written in Python. (Although it would be better with a dedicated 3-component vector type rather than a general purpose list or even numpy array.) It's still a puzzle why directly executing the code that numpy uses was still faster than numpy itself, when both were run with CPython. Unless numpy is perhaps using extra wrappers around numpy.cross. -- Bartc -- https://mail.python.org/mailman/listinfo/python-list
Re: Numpy slow at vector cross product?
On Mon, 21 Nov 2016 14:53:35 +, BartC wrote: > Also that the critical bits were not implemented in Python? That is correct. You'll notice that there aren't any loops in numpy.cross. It's just a wrapper around a bunch of vectorised operations (*, -, []). If you aren't taking advantage of vectorisation, there's no reason to expect numpy to be any faster than primitive operations, any more than you'd expect (numpy.array([1]) + numpy.array([2])[0] to be faster than "1+2". Beyond that, you'd expect a generic function to be at a disadvantage compared to a function which makes assumptions about its arguments. Given what it does, I wouldn't expect numpy.cross() to be faster for individual vectors if it was written in C. -- https://mail.python.org/mailman/listinfo/python-list
Re: Numpy slow at vector cross product?
Perhaps your implementation isn't as general as numpy's? I pulled out the TestCross class from numpy.core.tests.test_numeric and replaced calls to np.cross with calls to your function. I got an error in test_broadcasting_shapes: ValueError: operands could not be broadcast together with shapes (1,2) (5,) Skip -- https://mail.python.org/mailman/listinfo/python-list
Re: Numpy slow at vector cross product?
On 21/11/2016 12:44, Peter Otten wrote: After a look into the source this is no longer a big surprise (numpy 1.8.2): if axis is not None: axisa, axisb, axisc=(axis,)*3 a = asarray(a).swapaxes(axisa, 0) b = asarray(b).swapaxes(axisb, 0) The situation may be different when you process vectors in bulk, i. e. instead of [cross(a, bb) for bb in b] just say cross(a, b). I thought that numpy was supposed to be fast? Also that the critical bits were not implemented in Python? Anyway I tried your code (put into a function as shown below) in the same test program, and it was *still* 3 times as fast as numpy! (mycross() was 3 times as fast as np.cross().) Explain that... --- def mycross(a,b,axisa=-1, axisb=-1, axisc=-1, axis=None): if axis is not None: axisa, axisb, axisc=(axis,)*3 a = np.asarray(a).swapaxes(axisa, 0) b = np.asarray(b).swapaxes(axisb, 0) msg = "incompatible dimensions for cross product\n"\ "(dimension must be 2 or 3)" if (a.shape[0] not in [2, 3]) or (b.shape[0] not in [2, 3]): raise ValueError(msg) if a.shape[0] == 2: if (b.shape[0] == 2): cp = a[0]*b[1] - a[1]*b[0] if cp.ndim == 0: return cp else: return cp.swapaxes(0, axisc) else: x = a[1]*b[2] y = -a[0]*b[2] z = a[0]*b[1] - a[1]*b[0] elif a.shape[0] == 3: if (b.shape[0] == 3): x = a[1]*b[2] - a[2]*b[1] y = a[2]*b[0] - a[0]*b[2] z = a[0]*b[1] - a[1]*b[0] else: x = -a[2]*b[1] y = a[2]*b[0] z = a[0]*b[1] - a[1]*b[0] cp = np.array([x, y, z]) if cp.ndim == 1: return cp else: return cp.swapaxes(0, axisc) --- Tested as: x=np.array([1,2,3]) y=np.array([4,5,6]) start=time.clock() for i in range(loops): z=mycross(x,y) print "Calc, %s loops: %.2g seconds" %(loops,time.clock()-start) -- Bartc -- https://mail.python.org/mailman/listinfo/python-list
Re: Numpy slow at vector cross product?
On Mon, 21 Nov 2016 11:09 pm, BartC wrote: > On 21/11/2016 02:48, Steve D'Aprano wrote: [...] >> However, your code is not a great way of timing code. Timing code is >> *very* difficult, and can be effected by many things, such as external >> processes, CPU caches, even the function you use for getting the time. >> Much of the time you are timing here will be in creating the range(loops) >> list, especially if loops is big. > > But both loops are the same size. And that overhead can quickly be > disposed of by measuring empty loops in both cases. (On my machine, > about 0.006/7 seconds for loops of 100,000.) No, you cannot make that assumption, not in general. On modern machines, you cannot assume that the time it takes to execute foo() immediately followed by bar() is the same as the time it takes to execute foo() and bar() separately. Modern machines run multi-tasking operating systems, where there can be other processes running. Depending on what you use as your timer, you may be measuring the time that those other processes run. The OS can cache frequently used pieces of code, which allows it to run faster. The CPU itself will cache some code. The shorter the code snippet, the more these complications are relevant. In this particular case, we can be reasonably sure that the time it takes to create a list range(1) and the overhead of the loop is *probably* quite a small percentage of the time it takes to perform 10 vector multiplications. But that's not a safe assumption for all code snippets. This is why the timeit module exists: to do the right thing when it matters, so that you don't have to think about whether or not it matters. The timeit module works really really hard to get good quality, accurate timings, minimizing any potential overhead. The timeit module automates a bunch of tricky-to-right best practices for timing code. Is that a problem? >> The best way to time small snippets of code is to use the timeit module. >> Open a terminal or shell (*not* the Python interactive interpreter, the >> operating system's shell: you should expect a $ or % prompt) and run >> timeit from that. Copy and paste the following two commands into your >> shell prompt: >> >> >> python2.7 -m timeit --repeat 5 -s "import numpy as np" \ >> -s "x = np.array([1, 2, 3])" -s "y = np.array([4, 5, 6])" \ >> -- "np.cross(x, y)" [...] > Yes, I can see that typing all the code out again, and remembering all > those options and putting -s, -- and \ in all the right places, is a > much better way of doing it! Not error prone at all. Gosh Bart, how did you manage to write that sentence? How did you remember all those words, and remember to put the punctuation marks in the right places? You even used sarcasm! You must be a genius. (Oh look, I can use it too.) Seriously Bart? You've been a programmer for how many decades, and you can't work out how to call a command from the shell? This is about working effectively with your tools, and a basic understanding of the shell is an essential tool for programmers. This was a *simple* command. It was a LONG command, but don't be fooled by the length, and the fact that it went over multiple lines, it was dirt simple. I'm not saying that every programmer needs to be a greybeard Unix guru (heaven knows that I'm not!), but they ought to be able to run simple commands from the command line. Those who don't are in the same position as carpenters who don't know the differences between the various kinds of hammer or saws. Sure, you can still do a lot of work using just one kind of hammer and one kind of saw, but you'll work better, faster and smarter with the right kind. You don't use a rip saw to make fine cuts, and you don't use a mash hammer to drive tacks. The -m option lets you run a module without knowing the precise location of the source file. Some of the most useful commands to learn: python -m unittest ... python -m doctest ... python -m timeit ... The biggest advantage of calling the timeit module is that it will automatically select the number of iterations you need to run to get good timing results, without wasting time running excessive loops. (The timeit module is *significantly* improved in Python 3.6, but even in older versions its pretty good.) But if you prefer doing it "old school" from within Python, then: from timeit import Timer t = Timer('np.cross(x, y)', setup=""" import numpy as np x = np.array([1, 2, 3]) y = np.array([4, 5, 6]) """) # take five measurements of 10 calls each, and report the fastest result = min(t.repeat(number=10, repeat=5))/10 print(result) # time in seconds per call Better? -- Steve “Cheer up,” they said, “things could be worse.” So I cheered up, and sure enough, things got worse. -- https://mail.python.org/mailman/listinfo/python-list
Re: Numpy slow at vector cross product?
On Mon, Nov 21, 2016 at 1:38 AM, BartC wrote: > On 20/11/2016 20:46, DFS wrote: >> >> import sys, time, numpy as np >> loops=int(sys.argv[1]) >> >> x=np.array([1,2,3]) >> y=np.array([4,5,6]) >> start=time.clock() In Unix, time.clock doesn't measure wall-clock time, but rather an approximation to the CPU time used by the current process. On the other hand, time.time calls gettimeofday, if available, which has a resolution of 1 microsecond. Python 2 timing tests should use time.time on Unix. In Windows, time.time calls GetSystemTimeAsFileTime, which has a default resolution of only about 15 ms, adjustable down to about 1 ms. On other hand, time.clock calls QueryPerformanceCounter, which has a resolution of about 100 nanoseconds. Python 2 timing tests should use time.clock on Windows. In Python 3.3+, timing tests should use time.perf_counter. In Linux this calls clock_gettime using a monotonic clock with a resolution of 1 nanosecond, and in Windows it calls QueryPerformanceCounter. In any case, timeit.default_timer selects the best function to call for a given platform. -- https://mail.python.org/mailman/listinfo/python-list
Re: Numpy slow at vector cross product?
Steve D'Aprano wrote: > On Mon, 21 Nov 2016 07:46 am, DFS wrote: > >> import sys, time, numpy as np >> loops=int(sys.argv[1]) >> >> x=np.array([1,2,3]) >> y=np.array([4,5,6]) >> start=time.clock() >> for i in range(loops): >> np.cross(x,y) >> print "Numpy, %s loops: %.2g seconds" %(loops,time.clock()-start) > > [...] >> $ python vector_cross.py >> Numpy, 10 loops: 2.5 seconds >> Calc, 10 loops: 0.13 seconds >> >> >> Did I do something wrong, or is numpy slow at this? > > I can confirm similar results. > > However, your code is not a great way of timing code. Timing code is > *very* difficult, and can be effected by many things, such as external > processes, CPU caches, even the function you use for getting the time. > Much of the time you are timing here will be in creating the range(loops) > list, especially if loops is big. > > The best way to time small snippets of code is to use the timeit module. > Open a terminal or shell (*not* the Python interactive interpreter, the > operating system's shell: you should expect a $ or % prompt) and run > timeit from that. Copy and paste the following two commands into your > shell prompt: > > > python2.7 -m timeit --repeat 5 -s "import numpy as np" \ > -s "x = np.array([1, 2, 3])" -s "y = np.array([4, 5, 6])" \ > -- "np.cross(x, y)" > > > python2.7 -m timeit --repeat 5 -s "x = [1, 2, 3]" \ > -s "y = [4, 5, 6]" -s "z = [0, 0, 0]" \ > -- "z[0] = x[1]*y[2] - x[2]*y[1]; z[1] = x[2]*y[0] - \ > x[0]*y[2]; z[2] = x[0]*y[1] - x[1]*y[0]" > > > The results I get are: > > 1 loops, best of 5: 30 usec per loop > > 100 loops, best of 5: 1.23 usec per loop > > > So on my machine, np.cross() is about 25 times slower than multiplying by > hand. After a look into the source this is no longer a big surprise (numpy 1.8.2): if axis is not None: axisa, axisb, axisc=(axis,)*3 a = asarray(a).swapaxes(axisa, 0) b = asarray(b).swapaxes(axisb, 0) msg = "incompatible dimensions for cross product\n"\ "(dimension must be 2 or 3)" if (a.shape[0] not in [2, 3]) or (b.shape[0] not in [2, 3]): raise ValueError(msg) if a.shape[0] == 2: if (b.shape[0] == 2): cp = a[0]*b[1] - a[1]*b[0] if cp.ndim == 0: return cp else: return cp.swapaxes(0, axisc) else: x = a[1]*b[2] y = -a[0]*b[2] z = a[0]*b[1] - a[1]*b[0] elif a.shape[0] == 3: if (b.shape[0] == 3): x = a[1]*b[2] - a[2]*b[1] y = a[2]*b[0] - a[0]*b[2] z = a[0]*b[1] - a[1]*b[0] else: x = -a[2]*b[1] y = a[2]*b[0] z = a[0]*b[1] - a[1]*b[0] cp = array([x, y, z]) if cp.ndim == 1: return cp else: return cp.swapaxes(0, axisc) The situation may be different when you process vectors in bulk, i. e. instead of [cross(a, bb) for bb in b] just say cross(a, b). -- https://mail.python.org/mailman/listinfo/python-list
Re: Numpy slow at vector cross product?
On 21/11/2016 02:48, Steve D'Aprano wrote: On Mon, 21 Nov 2016 07:46 am, DFS wrote: start=time.clock() for i in range(loops): np.cross(x,y) print "Numpy, %s loops: %.2g seconds" %(loops,time.clock()-start) However, your code is not a great way of timing code. Timing code is *very* difficult, and can be effected by many things, such as external processes, CPU caches, even the function you use for getting the time. Much of the time you are timing here will be in creating the range(loops) list, especially if loops is big. But both loops are the same size. And that overhead can quickly be disposed of by measuring empty loops in both cases. (On my machine, about 0.006/7 seconds for loops of 100,000.) The best way to time small snippets of code is to use the timeit module. Open a terminal or shell (*not* the Python interactive interpreter, the operating system's shell: you should expect a $ or % prompt) and run timeit from that. Copy and paste the following two commands into your shell prompt: python2.7 -m timeit --repeat 5 -s "import numpy as np" \ -s "x = np.array([1, 2, 3])" -s "y = np.array([4, 5, 6])" \ -- "np.cross(x, y)" python2.7 -m timeit --repeat 5 -s "x = [1, 2, 3]" \ -s "y = [4, 5, 6]" -s "z = [0, 0, 0]" \ -- "z[0] = x[1]*y[2] - x[2]*y[1]; z[1] = x[2]*y[0] - \ x[0]*y[2]; z[2] = x[0]*y[1] - x[1]*y[0]" Yes, I can see that typing all the code out again, and remembering all those options and putting -s, -- and \ in all the right places, is a much better way of doing it! Not error prone at all. -- bartc -- https://mail.python.org/mailman/listinfo/python-list
Re: Numpy slow at vector cross product?
On Mon, 21 Nov 2016 07:46 am, DFS wrote: > import sys, time, numpy as np > loops=int(sys.argv[1]) > > x=np.array([1,2,3]) > y=np.array([4,5,6]) > start=time.clock() > for i in range(loops): > np.cross(x,y) > print "Numpy, %s loops: %.2g seconds" %(loops,time.clock()-start) [...] > $ python vector_cross.py > Numpy, 10 loops: 2.5 seconds > Calc, 10 loops: 0.13 seconds > > > Did I do something wrong, or is numpy slow at this? I can confirm similar results. However, your code is not a great way of timing code. Timing code is *very* difficult, and can be effected by many things, such as external processes, CPU caches, even the function you use for getting the time. Much of the time you are timing here will be in creating the range(loops) list, especially if loops is big. The best way to time small snippets of code is to use the timeit module. Open a terminal or shell (*not* the Python interactive interpreter, the operating system's shell: you should expect a $ or % prompt) and run timeit from that. Copy and paste the following two commands into your shell prompt: python2.7 -m timeit --repeat 5 -s "import numpy as np" \ -s "x = np.array([1, 2, 3])" -s "y = np.array([4, 5, 6])" \ -- "np.cross(x, y)" python2.7 -m timeit --repeat 5 -s "x = [1, 2, 3]" \ -s "y = [4, 5, 6]" -s "z = [0, 0, 0]" \ -- "z[0] = x[1]*y[2] - x[2]*y[1]; z[1] = x[2]*y[0] - \ x[0]*y[2]; z[2] = x[0]*y[1] - x[1]*y[0]" The results I get are: 1 loops, best of 5: 30 usec per loop 100 loops, best of 5: 1.23 usec per loop So on my machine, np.cross() is about 25 times slower than multiplying by hand. -- Steve “Cheer up,” they said, “things could be worse.” So I cheered up, and sure enough, things got worse. -- https://mail.python.org/mailman/listinfo/python-list
Re: Numpy slow at vector cross product?
On 20/11/2016 20:46, DFS wrote: import sys, time, numpy as np loops=int(sys.argv[1]) x=np.array([1,2,3]) y=np.array([4,5,6]) start=time.clock() for i in range(loops): np.cross(x,y) print "Numpy, %s loops: %.2g seconds" %(loops,time.clock()-start) x=[1,2,3] y=[4,5,6] z=[0,0,0] start=time.clock() for i in range(loops): z[0]=x[1]*y[2]-x[2]*y[1]; z[1]=x[2]*y[0]-x[0]*y[2]; z[2]=x[0]*y[1]-x[1]*y[0]; print "Calc, %s loops: %.2g seconds" %(loops,time.clock()-start) I don't know why numpy is slow, but I can confirm similar results. In fact if the workings are put into a function, then the difference is even more marked, with the normal calculation being 50% faster Maybe numpy has extra overheads, and the arrays being operated on are very small, but even so, 30 times slower than CPython? (2.5 to 0.083 seconds.) -- Bartc -- https://mail.python.org/mailman/listinfo/python-list