Re: [Numpy-discussion] Openmp support (was numpy's future (1.1 and beyond): which direction(s) ?)
Hi, Note that the plug-in idea is just my own idea, it is not something agreed by anyone else. So maybe it won't be done for numpy 1.1, or at all. It depends on the main maintainers of numpy. I'm +3 for the plugin idea - it would have huge benefits for installation and automatic optimization. What needs to be done? Who could do it? Matthew ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Openmp support (was numpy's future (1.1 and beyond): which direction(s) ?)
David Cournapeau wrote: Gnata Xavier wrote: Ok I will try to see what I can do but it is sure that we do need the plug-in system first (read before the threads in the numpy release). During the devel of 1.1, I will try to find some time to understand where I should put some pragma into ufunct using a very conservation approach. Any people with some OpenMP knowledge are welcome because I'm not a OpenMP expert but only an OpenMP user in my C/C++ codes. Note that the plug-in idea is just my own idea, it is not something agreed by anyone else. So maybe it won't be done for numpy 1.1, or at all. It depends on the main maintainers of numpy. and the results : 100080 10.308471 30.007250 100 160 1.9025635.800172 10 320 0.5430081.123274 1 640 0.2068230.223031 100012800.0888980.044268 100 25600.1504290.008880 10 51200.2895890.002084 --- On this machine, we should start to use threads *in this testcase* iif size=1 (a 100*100 image is a very very small one :)) Maybe openMP can be more clever, but it tends to show that openMP, when used naively, can *not* decide how many threads to use. That's really the core problem: again, I don't know much about openMP, but almost any project using multi-thread/multi-process and not being embarrassingly parallel has the problem that it makes things much slower for many cases where thread creation/management and co have a lot of overhead proportionally to the computation. The problem is to determine the N, dynamically, or in a way which works well for most cases. OpenMP was created for HPC, where you have very large data; it is not so obvious to me that it is adapted to numpy which has to be much more flexible. Being fast on a given problem is easy; being fast on a whole range, that's another story: the problem really is to be as fast as before on small arrays. The fact that matlab, while having much more ressources than us, took years to do it, makes me extremely skeptical on the efficient use of multi-threading without real benchmarks for numpy. They have a dedicated team, who developed a JIT for matlab, which insert multi-thread code on the fly (for m files, not when you are in the interpreter), and who uses multi-thread blas/lapack (which is already available in numpy depending on the blas/lapack you are using). But again, and that's really the only thing I have to say: prove me wrong :) David I can't :) I can't for a simple reason : Quoting IDL documentation : There are instances when allowing IDL to use its default thread pool settings can lead to undesired results. In some instances, a multithreaded implementation using the thread pool may actually take longer to complete a given job than a single-threaded implementation. http://idlastro.gsfc.nasa.gov/idl_html_help/The_IDL_Thread_Pool.html To prevent the use of the thread pool for computations that involve too few data elements, IDL supports a minimum threshold value for thread pool computations. The minimum threshold value is contained in the TPOOL_MIN_ELTS field of the !CPU system variable. See the following sections for details on modifying this value. At work, I can see people switching from IDL to numpy/scipy/pylab. They are very happy with numpy but they would to find this thread pool capability in numpy. All these guys come from C (or from fortran), often from C/fortran MPI or OpenMP. They know which part of a code should be thread and which part should not. As a result, they are very happy with the IDL thread pool. I'm just thinking how to translate that into numpy. Now I have to have a close look at the ufuncs code and to figure out how to add -fopenmp. From a very pragmatic point of view : What is the best/simplest way to use inline C or whatever to do that : I have a large array A and, at some points of my nice numpy code, I would like to compute let say the threaded sum or the sine of this array? Assuming that I know how to write it in C/OpenMP code. (The background is I really know that in my case it is much faster... and I asked my boss for a multi-core machine ;)). Cheers, Xavier ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Openmp support (was numpy's future (1.1 and beyond): which direction(s) ?)
A couple of thoughts on parallelism: 1. Can someone come up with a small set of cases and time them on numpy, IDL, Matlab, and C, using various parallel schemes, for each of a representative set of architectures? We're comparing a benchmark to itself on different architectures, rather than seeing whether the thread capability is helping our competition on the same architecture. If it's mostly not helping them, we can forget it for the time being. I suspect that it is, in fact, helping them, or at least not hurting them. 2. Would it slow things much to have some state that the routines check before deciding whether to run a parallel implementation or not? It could default to single thread except in the cases where parallelism always helps, but the user can configure it to multithread beyond certain threshholds of, say, number of elements. Then, in the short term, a savvy user can tweak that state to get parallelism for more than N elements. In the longer term, there could be a test routine that would run on install and configure the state for that particular machine. When numpy started it would read the saved file and computation would be optimized for that machine. The user could always override it. 3. We should remember the first rule of parallel programming, which Anne quotes as premature optimization is the root of all evil. There is a lot to fix in numpy that is more fundamental than speed. I am the first to want things fast (I would love my secondary eclipse analysis to run in less than a week), but we have gaping holes in documentation and other areas that one would expect to have been filled before a 1.0 release. I hope we can get them filled for 1.1. It bears repeating that our main resource shortage is in person-hours, and we'll get more of those as the community grows. Right now our deficit in documentation is hurting us badly, while our deficit in parallelism is not. There is no faster way of growing the community than making it trivial to learn how to use numpy without hand-holding from an experienced user. Let's explore parallelism to assess when and how it might be right to do it, but let's stay focussed on the fundamentals until we have those nailed. --jh-- ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Openmp support (was numpy's future (1.1 and beyond): which direction(s) ?)
A couple of thoughts on parallelism: 1. Can someone come up with a small set of cases and time them on numpy, IDL, Matlab, and C, using various parallel schemes, for each of a representative set of architectures? We're comparing a benchmark to itself on different architectures, rather than seeing whether the thread capability is helping our competition on the same architecture. If it's mostly not helping them, we can forget it for the time being. I suspect that it is, in fact, helping them, or at least not hurting them. Well I could ask some IDL users to provide you with benchmarks. In C/OpenMP I have posted a trivial code. 2. Would it slow things much to have some state that the routines check before deciding whether to run a parallel implementation or not? It could default to single thread except in the cases where parallelism always helps, but the user can configure it to multithread beyond certain threshholds of, say, number of elements. Then, in the short term, a savvy user can tweak that state to get parallelism for more than N elements. In the longer term, there could be a test routine that would run on install and configure the state for that particular machine. When numpy started it would read the saved file and computation would be optimized for that machine. The user could always override it. No it wouldn't cost that much and that is exactly the way IDL (for instance) works. 3. We should remember the first rule of parallel programming, which Anne quotes as premature optimization is the root of all evil. There is a lot to fix in numpy that is more fundamental than speed. I am the first to want things fast (I would love my secondary eclipse analysis to run in less than a week), but we have gaping holes in documentation and other areas that one would expect to have been filled before a 1.0 release. I hope we can get them filled for 1.1. It bears repeating that our main resource shortage is in person-hours, and we'll get more of those as the community grows. Right now our deficit in documentation is hurting us badly, while our deficit in parallelism is not. There is no faster way of growing the community than making it trivial to learn how to use numpy without hand-holding from an experienced user. Let's explore parallelism to assess when and how it might be right to do it, but let's stay focussed on the fundamentals until we have those nailed. That is well put and clear. It is also clear that our deficit in parallelism is not hurting us that badly. It is a real problem in some communities like astronomers and images processing people but the lack of documentation is the first one, that is true. Xavier ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Openmp support (was numpy's future (1.1 and beyond): which direction(s) ?)
It is a real problem in some communities like astronomers and images processing people but the lack of documentation is the first one, that is true. Even in those communities, I think that a lot could be done at a higher level, as what IPython1 does (tasks parallelism). Matthieu -- French PhD student Website : http://matthieu-brucher.developpez.com/ Blogs : http://matt.eifelle.com and http://blog.developpez.com/?blog=92 LinkedIn : http://www.linkedin.com/in/matthieubrucher ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Openmp support (was numpy's future (1.1 and beyond): which direction(s) ?)
On Sat, Mar 22, 2008 at 4:25 PM, Charles R Harris [EMAIL PROTECTED] wrote: On Sat, Mar 22, 2008 at 2:59 PM, Robert Kern [EMAIL PROTECTED] wrote: On Sat, Mar 22, 2008 at 2:04 PM, Charles R Harris [EMAIL PROTECTED] wrote: Maybe it's time to revisit the template subsystem I pulled out of Django. I am still -lots on using the Django template system. Please, please, please, look at Jinja or another templating package that could be dropped in without *any* modification. Well, I have a script that pulls the relevant parts out of Django. I know you had a bad experience, but... That said, Jinja looks interesting. It uses the Django syntax, which was one of the things I liked most about Django templates. In fact, it looks pretty much like Django templates made into a standalone application, which is what I was after. However, it's big, the installed egg is about 1Mib, which is roughly 12x the size as my cutdown version of Django, and it has some c-code, so would need building. The C code is optional. On the other hand, it also looks like it contains a lot of extraneous stuff, like translations, that could be removed. Would you be adverse to adding it in if it looks useful? I would still *really* prefer that you use a single-file templating module at the expense of template aesthetics and even features. I am still unconvinced that we need more features. You haven't shown me any concrete examples. If we do the features of a larger package that we need to cut down, I'd prefer a package that we can cut down by simply removing files, not one that requires the modification of files. -- Robert Kern I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth. -- Umberto Eco ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Openmp support (was numpy's future (1.1 and beyond): which direction(s) ?)
Matthieu Brucher wrote: It is a real problem in some communities like astronomers and images processing people but the lack of documentation is the first one, that is true. Even in those communities, I think that a lot could be done at a higher level, as what IPython1 does (tasks parallelism). Matthieu Well it is not that easy. We have several numpy code following like this : 1) open an large data file to get a numpy array 2) perform computations on this array (I'm only talking of the numpy part here. scipy is something else) 3) Write the result is another large file It is so simple to write using numpy :) Now, if I want to have several exe, step 3 is often a problem. The only simple way to speed this up is to slit step 2 into threads (assuming that there is no other possible optimisation like sse which is false but out of the scope of numpy users). Using C, we can do that using OpenMP pragma. It may not be optimal but it radio speedup/time_to_code is very large :) Now, we are switching from C to numpy because we cannot spend that much time to play with gdb/pointers to open an image anymore. Xavier ps : I have seen your blog and you can send me an email off line about this topic and what you are doing :) ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Openmp support (was numpy's future (1.1 and beyond): which direction(s) ?)
On Mon, Mar 24, 2008 at 10:35 AM, Robert Kern [EMAIL PROTECTED] wrote: On Sat, Mar 22, 2008 at 4:25 PM, Charles R Harris [EMAIL PROTECTED] wrote: On Sat, Mar 22, 2008 at 2:59 PM, Robert Kern [EMAIL PROTECTED] wrote: On Sat, Mar 22, 2008 at 2:04 PM, Charles R Harris [EMAIL PROTECTED] wrote: Maybe it's time to revisit the template subsystem I pulled out of Django. I am still -lots on using the Django template system. Please, please, please, look at Jinja or another templating package that could be dropped in without *any* modification. Well, I have a script that pulls the relevant parts out of Django. I know you had a bad experience, but... That said, Jinja looks interesting. It uses the Django syntax, which was one of the things I liked most about Django templates. In fact, it looks pretty much like Django templates made into a standalone application, which is what I was after. However, it's big, the installed egg is about 1Mib, which is roughly 12x the size as my cutdown version of Django, and it has some c-code, so would need building. The C code is optional. On the other hand, it also looks like it contains a lot of extraneous stuff, like translations, that could be removed. Would you be adverse to adding it in if it looks useful? I would still *really* prefer that you use a single-file templating module at the expense of template aesthetics and even features. I am still unconvinced that we need more features. You haven't shown me any concrete examples. If we do the features of a larger package that we need to cut down, I'd prefer a package that we can cut down by simply removing files, not one that requires the modification of files. If you simply pull out the template subsystem, it is about 1Mib, which is why Jinja looks like Django to me. If you remove extraneous files from Django, and probably Jinja, it comes down to about 400K. If you go on to remove extraneous capabilities it drops down to 100K. It could all be made into a single file. In fact I had a lot of Django rewritten with that in mind. Well, that and the fact I can't leave code untouched. What I liked about the idea, besides the syntax with if statements, nested for loops, includes, and a few filters, is that it allowed moving some of the common code out of the source and into higher level build files where variable values and flags could be set, making the whole build process more transparent and adaptable. That said, it is hard to beat the compactness of the current version. Chuck ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Openmp support (was numpy's future (1.1 and beyond): which direction(s) ?)
Matthew Brett wrote: I'm +3 for the plugin idea - it would have huge benefits for installation and automatic optimization. What needs to be done? Who could do it? The main issues are portability, and reliability I think. All OS supported by numpy have more or less a dynamic library loading support (that's how python itself works, after all, unless you compile everything statically), but the devil is in the details. In particular: - I am not sure whether plugin unloading is supported by all OS (Mac os X posix api does not enable unloading, for example, you have to use a specific API which I do not know anything about). Maybe we do not need it, I don't know; unloading sounds really difficult to support reliably anyway (how to make sure numpy won't use the functions of the plugins ?) - build issues: it is really the same thing than ctypes at the build level, I think - an api so that it can be used throughout numpy. there should be a clear interface for the plugins, which does not sound trivial either. That's the most difficult part in my mind (well, maybe not difficult, but time-consuming at least). That's one of the reason why I was thinking about a gradual move of most core functionalities of the core toward a separate C library, with a simple and crystal clear interface, without any reference to any python API, just plain C with plain pointers. We could then force this core, pure C library to be used only through dereferencing of an additional pointer, thus enabling dynamic change of the actual functions (at least when numpy is started). I have to say I really like the idea of more explicit separation between the actual computation code and the python wrapping; it can only help if we decide to write some of the wrappers in Cython/ctypes/whatever instead of pure C as today. It has many advantages in terms of reliability, maintainability and performance (testing performance would be much easier I think, since it could be done in pure C). cheers, David ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Openmp support (was numpy's future (1.1 and beyond): which direction(s) ?)
On Mon, Mar 24, 2008 at 12:12 PM, Gnata Xavier [EMAIL PROTECTED] wrote: Well it is not that easy. We have several numpy code following like this : 1) open an large data file to get a numpy array 2) perform computations on this array (I'm only talking of the numpy part here. scipy is something else) 3) Write the result is another large file It is so simple to write using numpy :) Now, if I want to have several exe, step 3 is often a problem. If that large file can be accessed by memory-mapping, then step 3 can actually be quite easy. You have one program make the empty file of the given size (f.seek(FILE_SIZE); f.write('\0'); f.seek(0,0)) and then make each of the parallel programs memory map the file and only write to their respective portions. -- Robert Kern I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth. -- Umberto Eco ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Openmp support (was numpy's future (1.1 and beyond): which direction(s) ?)
Robert Kern wrote: On Mon, Mar 24, 2008 at 12:12 PM, Gnata Xavier [EMAIL PROTECTED] wrote: Well it is not that easy. We have several numpy code following like this : 1) open an large data file to get a numpy array 2) perform computations on this array (I'm only talking of the numpy part here. scipy is something else) 3) Write the result is another large file It is so simple to write using numpy :) Now, if I want to have several exe, step 3 is often a problem. If that large file can be accessed by memory-mapping, then step 3 can actually be quite easy. You have one program make the empty file of the given size (f.seek(FILE_SIZE); f.write('\0'); f.seek(0,0)) and then make each of the parallel programs memory map the file and only write to their respective portions. Yep but that is the best case. Our standard case is a quite long sequence of simple computation on arrays. Some part are clearly thread-candidates but not every parts. For instance, at step N+1 I have to multiply foo by the sum of a large array computed at step N-1. I can split the sum computation over several exe but it is not convenient at all and not that easy to get the sum at the end (I know ugly ways to do that. ugly). One step large computations can be split into several exe. Several steps large one are another story :( Xavier ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Openmp support (was numpy's future (1.1 and beyond): which direction(s) ?)
On Sat, Mar 22, 2008 at 10:59 PM, David Cournapeau [EMAIL PROTECTED] wrote: Charles R Harris wrote: It looks like memory access is the bottleneck, otherwise running 4 floats through in parallel should go a lot faster. I need to modify the program a bit and see how it works for doubles. I am not sure the benchmark is really meaningful: it does not uses aligned buffers (16 bytes alignement), and because of that, does not give a good idea of what can be expected from SSE. It shows why it is not so easy to get good performances, and why just throwing a few optimized loops won't work, though. Using sse/sse2 from unaligned buffers is a waste of time. Without this alignement, you need to take into account the alignement (using _mm_loadu_ps vs _mm_load_ps), and that's extremely slow, basically killing most of the speed increase you can expect from using sse. Yep, but I expect the compilers to take care of alignment, say by inserting a few single ops when needed. So I would just as soon leave vectorization to the compilers and wait until they get that good. The only thing needed then is contiguous data. Chuck ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Openmp support (was numpy's future (1.1 and beyond): which direction(s) ?)
Charles R Harris wrote: Yep, but I expect the compilers to take care of alignment, say by inserting a few single ops when needed. The other solution would be to have aligned allocators (it won't solve all cases, of course). Because the compilers will never be able to take care of the cases where you call external libraries (fftw, where we could have between 50 % and more than 100 % speed increase if we had aligned data buffers by default). So I would just as soon leave vectorization to the compilers and wait until they get that good. The only thing needed then is contiguous data. For non contiguous data, things will be extremely slow anyway, so I don't think those need a lot of attention. If you care about performances, you should not use non contiguous data. cheers, David ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Openmp support (was numpy's future (1.1 and beyond): which direction(s) ?)
James Philbin wrote: OK, i've written a simple benchmark which implements an elementwise multiply (A=B*C) in three different ways (standard C, intrinsics, hand coded assembly). On the face of things the results seem to indicate that the vectorization works best on medium sized inputs. If people could post the results of running the benchmark on their machines (takes ~1min) along with the output of gcc --version and their chip model, that wd be v useful. It should be compiled with: gcc -msse -O2 vec_bench.c -o vec_bench CPU: Intel(R) Core(TM)2 CPU T7400 @ 2.16GHz (macbook, intel core 2 duo) gcc (GCC) 4.1.3 20070929 (prerelease) (Ubuntu 4.1.2-16ubuntu2) (ubuntu gutsy gibbon 7.10) $ ./vec_bench Testing methods... All OK Problem size Simple Intrin Inline 100 0.0003ms (100.0%) 0.0002ms ( 68.3%) 0.0002ms ( 75.6%) 1000 0.0023ms (100.0%) 0.0018ms ( 76.7%) 0.0020ms ( 87.1%) 1 0.0361ms (100.0%) 0.0193ms ( 53.4%) 0.0338ms ( 93.7%) 10 0.2839ms (100.0%) 0.1351ms ( 47.6%) 0.0937ms ( 33.0%) 100 4.2108ms (100.0%) 4.1234ms ( 97.9%) 4.0886ms ( 97.1%) 1000 45.3192ms (100.0%) 45.5359ms (100.5%) 45.3466ms (100.1%) Note that there is some variance in the results. Here is a second run to have an idea (look at Inline, size=1): $ ./vec_bench Testing methods... All OK Problem size Simple Intrin Inline 100 0.0003ms (100.0%) 0.0002ms ( 69.5%) 0.0002ms ( 74.1%) 1000 0.0024ms (100.0%) 0.0018ms ( 75.9%) 0.0020ms ( 86.4%) 1 0.0324ms (100.0%) 0.0186ms ( 57.3%) 0.0226ms ( 69.6%) 10 0.2840ms (100.0%) 0.1171ms ( 41.2%) 0.0939ms ( 33.1%) 100 4.4034ms (100.0%) 4.3657ms ( 99.1%) 4.0465ms ( 91.9%) 1000 44.4854ms (100.0%) 43.9502ms ( 98.8%) 43.6824ms ( 98.2%) HTH Emanuele ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Openmp support (was numpy's future (1.1 and beyond): which direction(s) ?)
Wow, a much more varied set of results than I was expecting. Could someone who has gcc 4.3 installed compile it with: gcc -msse -O2 -ftree-vectorize -ftree-vectorizer-verbose=5 -S vec_bench.c -o vec_bench.s And attach vec_bench.s and the verbose output from gcc. James ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Openmp support (was numpy's future (1.1 and beyond): which direction(s) ?)
Gnata Xavier wrote: Hi, I have a very limited knowledge of openmp but please consider this testcase : Honestly, if it was that simple, it would already have been done for a long time. The problem is that your test-case is not even remotely close to how things have to be done in numpy. cheers, David ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Openmp support (was numpy's future (1.1 and beyond): which direction(s) ?)
A Sunday 23 March 2008, Charles R Harris escrigué: gcc --version: gcc (GCC) 4.1.2 20070925 (Red Hat 4.1.2-33) cpu: Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz Problem size Simple Intrin Inline 100 0.0002ms (100.0%) 0.0001ms ( 68.7%) 0.0001ms ( 74.8%) 1000 0.0015ms (100.0%) 0.0011ms ( 72.0%) 0.0012ms ( 80.4%) 1 0.0154ms (100.0%) 0.0111ms ( 72.1%) 0.0122ms ( 79.1%) 10 0.1081ms (100.0%) 0.0759ms ( 70.2%) 0.0811ms ( 75.0%) 100 2.7778ms (100.0%) 2.8172ms (101.4%) 2.7929ms ( 100.5%) 1000 28.1577ms (100.0%) 28.7332ms (102.0%) 28.4669ms ( 101.1%) I'm mystified about your machine requiring just 28s for completing the 10 million test, and most of the other, similar processors (some faster than yours), in this thread falls pretty far from your figure. What sort of memory subsystem are you using? It looks like memory access is the bottleneck, otherwise running 4 floats through in parallel should go a lot faster. Yes, that's probably right. This test is mainly measuring the memory access speed of machines for large datasets. For small ones, my guess is that the data is directly placed in caches, so there is no need to transport them to the CPU prior to do the calculations. However, I'm not sure whether this kind of optimizations for small datasets would be very useful in practice (read general NumPy calculations), but I'm rather sceptical about this. Cheers, -- 0,0 Francesc Altet http://www.carabos.com/ V V Cárabos Coop. V. Enjoy Data - ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Openmp support (was numpy's future (1.1 and beyond): which direction(s) ?)
A Sunday 23 March 2008, David Cournapeau escrigué: Gnata Xavier wrote: Hi, I have a very limited knowledge of openmp but please consider this testcase : Honestly, if it was that simple, it would already have been done for a long time. The problem is that your test-case is not even remotely close to how things have to be done in numpy. Why not? IMHO, complex operations requiring a great deal of operations per word, like trigonometric, exponential, etc..., are the best candidates to take advantage of several cores or even SSE instructions (not sure whether SSE supports this sort of operations, though). At any rate, this is exactly the kind of parallel optimizations that make sense in Numexpr, in the sense that you could obtain decent speedups with multicore processors. Cheers, -- 0,0 Francesc Altet http://www.carabos.com/ V V Cárabos Coop. V. Enjoy Data - ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Openmp support (was numpy's future (1.1 and beyond): which direction(s) ?)
Francesc Altet wrote: Why not? IMHO, complex operations requiring a great deal of operations per word, like trigonometric, exponential, etc..., are the best candidates to take advantage of several cores or even SSE instructions (not sure whether SSE supports this sort of operations, though). I was talking about the general using openmp thing in numpy context. If it was just adding one line at one place in the source code, someone would already have done it, no ? But there are build issues, for example: you have to add support for openmp at compilation and link, you have to make sure it works with compilers which do not support it. Even without taking into account the build issues, there is the problem of correctly annotating the source code depending on the context. For example, many interesting places where to use openmp in numpy would need more than just using the parallel for pragma. From what I know of openMP, the annotations may depend on the kind of operation you are doing (independent element-wise operations or not). Also, the test case posted before use a really big N, where you are sure that using multi-thread is efficient. What happens if N is small ? Basically, the posted test is the best situation which can happen (big N, known operation with known context, etc...). That's a proof that openMP works, not that it can work for numpy. I find the example of sse rather enlightening: in theory, you should expect a 100-300 % speed increase using sse, but even with pure C code in a controlled manner, on one platform (linux + gcc), with varying, recent CPU, the results are fundamentally different. So what would happen in numpy, where you don't control things that much ? cheers, David ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Openmp support (was numpy's future (1.1 and beyond): which direction(s) ?)
I find the example of sse rather enlightening: in theory, you should expect a 100-300 % speed increase using sse, but even with pure C code in a controlled manner, on one platform (linux + gcc), with varying, recent CPU, the results are fundamentally different. So what would happen in numpy, where you don't control things that much ? This means that what we measure is not what we think we measure. The time we get is not only dependent on the number of instructions. Did someone make a complete instrumented profile of the code that everyone is testing with callgrind or the Visual Studio profiler ? This will tell us excatly what is happening : - instructions - cache issues (that is likely to be the bottleneck, but without a proof, nothing should be done about it) - SSE efficiency - ... I think that to be really efficient, one would have to use a dynamic prefetcher, but these things are not available on x86 and even it were the case will never make it to the general public because they can't be proof tested (binary modifications on the fly). But they are really efficient when going through an array. Matthieu -- French PhD student Website : http://matthieu-brucher.developpez.com/ Blogs : http://matt.eifelle.com and http://blog.developpez.com/?blog=92 LinkedIn : http://www.linkedin.com/in/matthieubrucher ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Openmp support (was numpy's future (1.1 and beyond): which direction(s) ?)
David Cournapeau wrote: Francesc Altet wrote: Why not? IMHO, complex operations requiring a great deal of operations per word, like trigonometric, exponential, etc..., are the best candidates to take advantage of several cores or even SSE instructions (not sure whether SSE supports this sort of operations, though). I was talking about the general using openmp thing in numpy context. If it was just adding one line at one place in the source code, someone would already have done it, no ? But there are build issues, for example: you have to add support for openmp at compilation and link, you have to make sure it works with compilers which do not support it. Even without taking into account the build issues, there is the problem of correctly annotating the source code depending on the context. For example, many interesting places where to use openmp in numpy would need more than just using the parallel for pragma. From what I know of openMP, the annotations may depend on the kind of operation you are doing (independent element-wise operations or not). Also, the test case posted before use a really big N, where you are sure that using multi-thread is efficient. What happens if N is small ? Basically, the posted test is the best situation which can happen (big N, known operation with known context, etc...). That's a proof that openMP works, not that it can work for numpy. I find the example of sse rather enlightening: in theory, you should expect a 100-300 % speed increase using sse, but even with pure C code in a controlled manner, on one platform (linux + gcc), with varying, recent CPU, the results are fundamentally different. So what would happen in numpy, where you don't control things that much ? cheers, David Well of course my goal was not to say that my simple testcase can be copied/pasted into numpy :) Of ourse it is one of the best case to use openmp. Of course pragma can be more complex than that (you can tell variables that can/cannot be shared for instance). The size : Using openmp will be slower on small arrays, that is clear but the user doing very large computations is smart enough to know when he need to split it's job into threads. The obvious solution is to provide the user with // and non // functions. sse : sse can help a lot but multithreading just scales where sse mono-thread based solutions don't. Build/link : It is an issue. It has to be tested. I do not know because I haven't even tried. So, IMHO it would be nice to try to put some OpenMP simple pragmas into numpy *only to see what is going on*. Even if it only work with gcc or even if...I do not know... It would be a first step. step by step :) If the performances are so bad, ok, forget about itbut it would be sad because the next generation CPU will not be more powerfull, they will only have more that one or two cores on the same chip. Xavier ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Openmp support (was numpy's future (1.1 and beyond): which direction(s) ?)
Gnata Xavier wrote: Well of course my goal was not to say that my simple testcase can be copied/pasted into numpy :) Of ourse it is one of the best case to use openmp. Of course pragma can be more complex than that (you can tell variables that can/cannot be shared for instance). The size : Using openmp will be slower on small arrays, that is clear but the user doing very large computations is smart enough to know when he need to split it's job into threads. The obvious solution is to provide the user with // and non // functions. IMHO, that's a really bad solution. It should be dynamically enabled (like in matlab, if I remember correctly). And this means having a plug subsystem to load/unload different implementation... that is one of the thing I was interested in getting done for numpy 1.1 (or above). For small arrays: how much slower ? Does it make the code slower than without open mp ? For example, what does your code says when N is 10, 100, 1000 ? sse : sse can help a lot but multithreading just scales where sse mono-thread based solutions don't. It depends: it scales pretty well if you use several processus, and if you can design your application in a multi-process way. Build/link : It is an issue. It has to be tested. I do not know because I haven't even tried. So, IMHO it would be nice to try to put some OpenMP simple pragmas into numpy *only to see what is going on*. Even if it only work with gcc or even if...I do not know... It would be a first step. step by step :) I agree about the step by step approach; I am just not sure I agree with your steps, that's all. Personally, I would first try getting a plug-in system working with numpy. But really, prove me wrong. Do it, try putting some pragma at some places in the ufunc machinery or somewhere else; as I said earlier, I would be happy to add support for open mp at the build level, at least in numscons. I would love being proven wrong and having a numpy which scales well with multi-core :) cheers, David ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Openmp support (was numpy's future (1.1 and beyond): which direction(s) ?)
If the performances are so bad, ok, forget about itbut it would be sad because the next generation CPU will not be more powerfull, they will only have more that one or two cores on the same chip. I don't think this is the worst that will happen. The worst is what has been seen for decades : the CPU raw power raising faster than memory speed (bandwidth and latency). With the next generation of Intel's CPU, the memory controller will at last be on the CPU and correctly shared between cores, but for the moment, with our issues, splitting this kind of parallel jobs (additions, subtractions, ...) will not enhance speed as the bottleneck is the memory controller/system bus that is already used at 100% by one core. Matthieu -- French PhD student Website : http://matthieu-brucher.developpez.com/ Blogs : http://matt.eifelle.com and http://blog.developpez.com/?blog=92 LinkedIn : http://www.linkedin.com/in/matthieubrucher ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Openmp support (was numpy's future (1.1 and beyond): which direction(s) ?)
Hi David et al, Very interesting. I thought that the 64-bit gcc's automatically aligned memory on 16-bit (or 32-bit) boundaries. But apparently not. Because running your code certainly made the intrinsic code quite a bit faster. However, another thing that I noticed was that the simple code was _much_ faster using gcc-4.3 with -O3 than with -O2. I've noticed this will some other code recently as well -- the auto loop-unrolling really helps for this type of code. You can see my benchmarks here (posted there to avoind line wrap issues): http://www.cv.nrao.edu/~sransom/vec_results.txt Scott On Sun, Mar 23, 2008 at 01:59:39PM +0900, David Cournapeau wrote: Charles R Harris wrote: It looks like memory access is the bottleneck, otherwise running 4 floats through in parallel should go a lot faster. I need to modify the program a bit and see how it works for doubles. I am not sure the benchmark is really meaningful: it does not uses aligned buffers (16 bytes alignement), and because of that, does not give a good idea of what can be expected from SSE. It shows why it is not so easy to get good performances, and why just throwing a few optimized loops won't work, though. Using sse/sse2 from unaligned buffers is a waste of time. Without this alignement, you need to take into account the alignement (using _mm_loadu_ps vs _mm_load_ps), and that's extremely slow, basically killing most of the speed increase you can expect from using sse. Here what I get with the above benchmark: 100 0.0002ms (100.0%) 0.0001ms ( 71.5%) 0.0001ms ( 85.0%) 1000 0.0014ms (100.0%) 0.0010ms ( 70.6%) 0.0013ms ( 96.8%) 1 0.0162ms (100.0%) 0.0095ms ( 58.2%) 0.0128ms ( 78.7%) 10 0.4189ms (100.0%) 0.4135ms ( 98.7%) 0.4149ms ( 99.0%) 100 5.9523ms (100.0%) 5.8933ms ( 99.0%) 5.8910ms ( 99.0%) 1000 58.9645ms (100.0%) 58.2620ms ( 98.8%) 58.7443ms ( 99.6%) Basically, no help at all: this is on a P4, which fpu is extremely slow if not used with optimized sse. Now, if I use posix_memalign, replace the intrinsics for aligned access, and use an accurate cycle counter (cycle.h, provided by fftw). Compiled as is: Testing methods... All OK Problem size Simple Intrin Inline 1004.16e+02 cycles (100.0%)4.04e+02 cycles ( 97.1%)4.92e+02 cycles (118.3%) 10003.66e+03 cycles (100.0%)3.11e+03 cycles ( 84.8%)4.10e+03 cycles (111.9%) 13.47e+04 cycles (100.0%)3.01e+04 cycles ( 86.7%)4.06e+04 cycles (116.8%) 101.36e+06 cycles (100.0%)1.34e+06 cycles ( 98.7%)1.45e+06 cycles (106.7%) 1001.92e+07 cycles (100.0%)1.87e+07 cycles ( 97.1%)1.89e+07 cycles ( 98.2%) 10001.86e+08 cycles (100.0%)1.80e+08 cycles ( 96.8%)1.81e+08 cycles ( 97.4%) Compiled with -DALIGNED, wich uses aligned access intrinsics: Testing methods... All OK Problem size Simple Intrin Inline 1004.16e+02 cycles (100.0%)1.96e+02 cycles ( 47.1%)4.92e+02 cycles (118.3%) 10003.82e+03 cycles (100.0%)1.56e+03 cycles ( 40.8%)4.22e+03 cycles (110.4%) 13.46e+04 cycles (100.0%)1.92e+04 cycles ( 55.5%)4.13e+04 cycles (119.4%) 101.32e+06 cycles (100.0%)1.12e+06 cycles ( 85.0%)1.16e+06 cycles ( 87.8%) 1001.95e+07 cycles (100.0%)1.92e+07 cycles ( 98.3%)1.95e+07 cycles (100.2%) 10001.82e+08 cycles (100.0%)1.79e+08 cycles ( 98.4%)1.81e+08 cycles ( 99.3%) This gives a drastic difference (I did not touch inline code, because it is sunday and I am lazy). If I use this on a sane CPU (core 2 duo, macbook) instead of my pentium4, I get better results (in particular, sse code is never slower, and I get a double speed increase as long as the buffer can be in cache). It looks like using prefect also gives some improvements when on the edge of the cache size (my P4 has a 512 kb L2 cache): Testing methods... All OK Problem size Simple Intrin Inline 1004.16e+02 cycles (100.0%)2.52e+02 cycles ( 60.6%)4.92e+02 cycles (118.3%) 10003.55e+03 cycles (100.0%)1.85e+03 cycles ( 52.2%)4.21e+03 cycles (118.7%) 13.48e+04 cycles (100.0%)1.76e+04 cycles ( 50.6%)4.13e+04 cycles (118.9%) 101.11e+06 cycles (100.0%)
Re: [Numpy-discussion] Openmp support (was numpy's future (1.1 and beyond): which direction(s) ?)
Scott Ransom wrote: Hi David et al, Very interesting. I thought that the 64-bit gcc's automatically aligned memory on 16-bit (or 32-bit) boundaries. Note that I am talking about bytes, not bits. Default alignement depend on many parameters, like the OS, C runtime. For example, on mac os X, malloc defaults to 16 bytes aligned (I guess this comes from ppc ages, where the only way to keep up with x86 was to aggressively use altivec). On glibc, it is 8 bytes aligned; for big sizes (where big is linked to the mmap threshold), it is almost never 16 bytes aligned (there was a discussion on this on numpy ML initiated by Steve G. Johnson, one of the main FFTW developer). I don't know about dependency on 64 bits archs. IMHO, the only real solution for this point is to have some support for aligned buffers in numpy, with aligned memory allocators. cheers, David ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Openmp support (was numpy's future (1.1 and beyond): which direction(s) ?)
On Sun, Mar 23, 2008 at 6:41 AM, Francesc Altet [EMAIL PROTECTED] wrote: A Sunday 23 March 2008, Charles R Harris escrigué: gcc --version: gcc (GCC) 4.1.2 20070925 (Red Hat 4.1.2-33) cpu: Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz Problem size Simple Intrin Inline 100 0.0002ms (100.0%) 0.0001ms ( 68.7%) 0.0001ms ( 74.8%) 1000 0.0015ms (100.0%) 0.0011ms ( 72.0%) 0.0012ms ( 80.4%) 1 0.0154ms (100.0%) 0.0111ms ( 72.1%) 0.0122ms ( 79.1%) 10 0.1081ms (100.0%) 0.0759ms ( 70.2%) 0.0811ms ( 75.0%) 100 2.7778ms (100.0%) 2.8172ms (101.4%) 2.7929ms ( 100.5%) 1000 28.1577ms (100.0%) 28.7332ms (102.0%) 28.4669ms ( 101.1%) I'm mystified about your machine requiring just 28s for completing the 10 million test, and most of the other, similar processors (some faster than yours), in this thread falls pretty far from your figure. What sort of memory subsystem are you using? Yeah, I noticed that ;) The cpu is an E6600, which was the low end of the performance core duo processors before the recent Intel releases, the north bridge (memory controller) is a P35, and the memory is DDR2 running at 800 MHz with 4-4-4-12 timing. The only things I tweaked were the memory voltage and timings. Raising the memory speed from 667 to 800 made a noticeable difference in my perception of speed, which is remarkable in itself. The motherboard was cheap, it goes for $70 these days. I've seen folks overclocking the E6600 up to 3.8 GHz and over 3GHz is common. Sometimes it's almost tempting... Chuck ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Openmp support (was numpy's future (1.1 and beyond): which direction(s) ?)
OK, i'm really impressed with the improvements in vectorization for gcc 4.3. It really seems like it's able to work with real loops which wasn't the case with 4.1. I think Chuck's right that we should simply special case contiguous data and allow the auto-vectorizer to do the rest. Something like this for the ufuncs: /**begin repeat #TYPE=(BOOL, BYTE,UBYTE,SHORT,USHORT,INT,UINT,LONG,ULONG,LONGLONG,ULONGLONG,FLOAT,DOUBLE,LONGDOUBLE)*2# #OP=||, +*13, ^, -*13# #kind=add*14, subtract*14# #typ=(Bool, byte, ubyte, short, ushort, int, uint, long, ulong, longlong, ulonglong, float, double, longdouble)*2# */ static void @[EMAIL PROTECTED]@[EMAIL PROTECTED](@typ@ *i1, @typ@ *i2, @type@ *op, int n) { int i; for (i=0; in; i++) { op[i] = i1[i] @OP@ i2[i]; } } static void @[EMAIL PROTECTED]@kind@(char **args, intp *dimensions, intp *steps, void *func) { register intp i; intp is1=steps[0],is2=steps[1],os=steps[2], n=dimensions[0]; char *i1=args[0], *i2=args[1], *op=args[2]; if (is1==1 is2==1 os==1) return @[EMAIL PROTECTED]@[EMAIL PROTECTED]((@typ@ *)i1, (@typ@ *)i2, (@typ@ *)os, n); for(i=0; in; i++, i1+=is1, i2+=is2, op+=os) { *((@typ@ *)op)=*((@typ@ *)i1) @OP@ *((@typ@ *)i2); } } /**end repeat**/ We also need to add -ftree-vectorize to the standard compile flags at least for the ufuncs. James ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Openmp support (was numpy's future (1.1 and beyond): which direction(s) ?)
On 23/03/2008, David Cournapeau [EMAIL PROTECTED] wrote: Gnata Xavier wrote: Hi, I have a very limited knowledge of openmp but please consider this testcase : Honestly, if it was that simple, it would already have been done for a long time. The problem is that your test-case is not even remotely close to how things have to be done in numpy. Actually, there are a few places where a parallel for would serve to accelerate all ufuncs. There are build issues, yes, though they are mild; we would also want to provide some API to turn parallelization on and off, and we'd have to verify that OpenMP did not slow down small arrays, but that would be it. (And I suspect that OpenMP is smart enough to use single threads without locking when multiple threads won't help. Certainly all the information is available to OpenMP to make such decisions.) This is why I suggested making this change: it should be a low-cost, high-gain change. Anne ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Openmp support (was numpy's future (1.1 and beyond): which direction(s) ?)
(And I suspect that OpenMP is smart enough to use single threads without locking when multiple threads won't help. Certainly all the information is available to OpenMP to make such decisions.) Unfortunately, I don't think there is such a think. For instance the number of threads used by MKL is told by a environment variable. Matthieu -- French PhD student Website : http://matthieu-brucher.developpez.com/ Blogs : http://matt.eifelle.com and http://blog.developpez.com/?blog=92 LinkedIn : http://www.linkedin.com/in/matthieubrucher ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Openmp support (was numpy's future (1.1 and beyond): which direction(s) ?)
Anne Archibald wrote: Actually, there are a few places where a parallel for would serve to accelerate all ufuncs. There are build issues, yes, though they are mild; Maybe, maybe not. Anyway, I said that I would step in to resolve those issues if someone else does the coding. we would also want to provide some API to turn parallelization on and off, and we'd have to verify that OpenMP did not slow down small arrays, but that would be it. (And I suspect that OpenMP is smart enough to use single threads without locking when multiple threads won't help. Certainly all the information is available to OpenMP to make such decisions.) How so ? Maybe you're right, but that's not so obvious to me. But since several people seem to know openMP and are eager to add it to numpy, it should be easy to add it to numpy: all they have to do it getting numpy sources from svn, and start coding :) With numscons at least, they can manually add the -fopenmp and -lgomp flags from the command line to quickly do a prototype (it should not be much more difficult with distutils). cheers, David ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Openmp support (was numpy's future (1.1 and beyond): which direction(s) ?)
Gnata Xavier wrote: Ok I will try to see what I can do but it is sure that we do need the plug-in system first (read before the threads in the numpy release). During the devel of 1.1, I will try to find some time to understand where I should put some pragma into ufunct using a very conservation approach. Any people with some OpenMP knowledge are welcome because I'm not a OpenMP expert but only an OpenMP user in my C/C++ codes. Note that the plug-in idea is just my own idea, it is not something agreed by anyone else. So maybe it won't be done for numpy 1.1, or at all. It depends on the main maintainers of numpy. and the results : 100080 10.308471 30.007250 100 160 1.9025635.800172 10 320 0.5430081.123274 1 640 0.2068230.223031 100012800.0888980.044268 100 25600.1504290.008880 10 51200.2895890.002084 --- On this machine, we should start to use threads *in this testcase* iif size=1 (a 100*100 image is a very very small one :)) Maybe openMP can be more clever, but it tends to show that openMP, when used naively, can *not* decide how many threads to use. That's really the core problem: again, I don't know much about openMP, but almost any project using multi-thread/multi-process and not being embarrassingly parallel has the problem that it makes things much slower for many cases where thread creation/management and co have a lot of overhead proportionally to the computation. The problem is to determine the N, dynamically, or in a way which works well for most cases. OpenMP was created for HPC, where you have very large data; it is not so obvious to me that it is adapted to numpy which has to be much more flexible. Being fast on a given problem is easy; being fast on a whole range, that's another story: the problem really is to be as fast as before on small arrays. The fact that matlab, while having much more ressources than us, took years to do it, makes me extremely skeptical on the efficient use of multi-threading without real benchmarks for numpy. They have a dedicated team, who developed a JIT for matlab, which insert multi-thread code on the fly (for m files, not when you are in the interpreter), and who uses multi-thread blas/lapack (which is already available in numpy depending on the blas/lapack you are using). But again, and that's really the only thing I have to say: prove me wrong :) David ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Openmp support (was numpy's future (1.1 and beyond): which direction(s) ?)
Matthieu Brucher wrote: Hi, It seems complicated to add OpenMP in the code, I don't think many people have the knowlegde to do this, not mentioning the fact that there are a lotof Python calls in the different functions. Yes, this makes potential optimizations harder, at least for someone like me who do not know well about numpy internals. That's still something I have not thought a lot about, but that's an example of why I like the idea of splitting numpy C code in core C / wrappers: you would only use open MP in the core C library, and everything would be transparent at higher levels (if I understand correctly how openMP works, which may very well not be true :) ). OpenMP, sse, etc... those are different views of the same underlying problem, in this context. But I do not know enough about numpy internals yet (in particular, how the number protocol works, and the relationship with the ufunc machinery) to know if it is feasible in a reasonable number of hours, or even if it is feasible at all :) The multicore Matlab does seems for more related to the underlying libraries than to something they did. Yes, that's why I put the matlab link: actually, most of the parallel thing it does is related to the mkl and co. That's something which is much easier to handle, and possible right now if I understand right ? cheers, David ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Openmp support (was numpy's future (1.1 and beyond): which direction(s) ?)
Personally, I think that the time would be better spent optimizing routines for single-threaded code and relying on BLAS and LAPACK libraries to use multiple cores for more complex calculations. In particular, doing some basic loop unrolling and SSE versions of the ufuncs would be beneficial. I have some experience writing SSE code using intrinsics and would be happy to give it a shot if people tell me what functions I should focus on. James ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Openmp support (was numpy's future (1.1 and beyond): which direction(s) ?)
James Philbin wrote: Personally, I think that the time would be better spent optimizing routines for single-threaded code and relying on BLAS and LAPACK libraries to use multiple cores for more complex calculations. In particular, doing some basic loop unrolling and SSE versions of the ufuncs would be beneficial. I have some experience writing SSE code using intrinsics and would be happy to give it a shot if people tell me what functions I should focus on. James gcc keeps advancing autovectorization. Is manual vectorization worth the trouble? ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Openmp support (was numpy's future (1.1 and beyond): which direction(s) ?)
gcc keeps advancing autovectorization. Is manual vectorization worth the trouble? Well, the way that the ufuncs are written at the moment, -ftree-vectorize will never kick in due to the non-constant strides. To get this to work, one has to special case out unary strides. Even with constant strides -ftree-vectorize often produces sub-optimal code as it has to make very conservative assumptions about the content of variables (it can do better if -fstrict-aliasing is used, but I think numpy is not compiled with this flag). So in other words, yes, I think it is worth hand vectorizing (and unrolling) the most common cases. James ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Openmp support (was numpy's future (1.1 and beyond): which direction(s) ?)
On Sat, Mar 22, 2008 at 11:43 AM, Neal Becker [EMAIL PROTECTED] wrote: James Philbin wrote: Personally, I think that the time would be better spent optimizing routines for single-threaded code and relying on BLAS and LAPACK libraries to use multiple cores for more complex calculations. In particular, doing some basic loop unrolling and SSE versions of the ufuncs would be beneficial. I have some experience writing SSE code using intrinsics and would be happy to give it a shot if people tell me what functions I should focus on. James gcc keeps advancing autovectorization. Is manual vectorization worth the trouble? The inner loop of a unary ufunc looks like /*UFUNC_API*/ static void PyUFunc_d_d(char **args, intp *dimensions, intp *steps, void *func) { intp i; char *ip1=args[0], *op=args[1]; for(i=0; i*dimensions; i++, ip1+=steps[0], op+=steps[1]) { *(double *)op = ((DoubleUnaryFunc *)func)(*(double *)ip1); } } While it might help the compiler to put the steps on the stack as constants, it is hard to see how the compiler could vectorize the loop given the information available and the fact that the input data might not be aligned or contiguous. I suppose one could make a small local buffer, copy the data into it, and then use sse, and that might actually help for some things. But it is also likely that the function itself won't deal gracefully with vectorized data. Chuck ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Openmp support (was numpy's future (1.1 and beyond): which direction(s) ?)
On Sat, Mar 22, 2008 at 12:01 PM, James Philbin [EMAIL PROTECTED] wrote: gcc keeps advancing autovectorization. Is manual vectorization worth the trouble? Well, the way that the ufuncs are written at the moment, -ftree-vectorize will never kick in due to the non-constant strides. To get this to work, one has to special case out unary strides. Even with constant strides -ftree-vectorize often produces sub-optimal code as it has to make very conservative assumptions about the content of variables (it can do better if -fstrict-aliasing is used, but I think Numpy would die with strict-aliasing because it relies on casting char pointers to various types. It might be possible to use void pointers, but that would be a major do over and it isn't clear what the strides and offsets, currently in char units, would become. Chuck ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Openmp support (was numpy's future (1.1 and beyond): which direction(s) ?)
James Philbin wrote: Personally, I think that the time would be better spent optimizing routines for single-threaded code and relying on BLAS and LAPACK libraries to use multiple cores for more complex calculations. In particular, doing some basic loop unrolling and SSE versions of the ufuncs would be beneficial. I have some experience writing SSE code using intrinsics and would be happy to give it a shot if people tell me what functions I should focus on. Fabulous! This is on my Project List of todo items for NumPy. See http://projects.scipy.org/scipy/numpy/wiki/ProjectIdeas I should spend some time refactoring the ufunc loops so that the templating does not get in the way of doing this on a case by case basis. 1) You should focus on the math operations: add, subtract, multiply, divide, and so forth. 2) Then for combined operations we should expose the functionality at a high-level. So, that somebody could write code to take advantage of it. It would be easiest to use intrinsics which would then work for AMD, Intel, on multiple compilers. -Travis O. ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Openmp support (was numpy's future (1.1 and beyond): which direction(s) ?)
Charles R Harris wrote: On Sat, Mar 22, 2008 at 11:43 AM, Neal Becker [EMAIL PROTECTED] mailto:[EMAIL PROTECTED] wrote: James Philbin wrote: Personally, I think that the time would be better spent optimizing routines for single-threaded code and relying on BLAS and LAPACK libraries to use multiple cores for more complex calculations. In particular, doing some basic loop unrolling and SSE versions of the ufuncs would be beneficial. I have some experience writing SSE code using intrinsics and would be happy to give it a shot if people tell me what functions I should focus on. James gcc keeps advancing autovectorization. Is manual vectorization worth the trouble? The inner loop of a unary ufunc looks like /*UFUNC_API*/ static void PyUFunc_d_d(char **args, intp *dimensions, intp *steps, void *func) { intp i; char *ip1=args[0], *op=args[1]; for(i=0; i*dimensions; i++, ip1+=steps[0], op+=steps[1]) { *(double *)op = ((DoubleUnaryFunc *)func)(*(double *)ip1); } } While it might help the compiler to put the steps on the stack as constants, it is hard to see how the compiler could vectorize the loop given the information available and the fact that the input data might not be aligned or contiguous. I suppose one could make a small local buffer, copy the data into it, and then use sse, and that might actually help for some things. But it is also likely that the function itself won't deal gracefully with vectorized data. I think the thing to do is to special-case the code so that if the strides work for vectorization, then a different bit of code is executed and this current code is used as the final special-case. Something like this would be relatively straightforward, if a bit tedious, to do. -Travis ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Openmp support (was numpy's future (1.1 and beyond): which direction(s) ?)
OK, so a few questions: 1. I'm not familiar with the format of the code generators. Should I pull the special case out of the /** begin repeats or should I do a conditional inside the repeats (how does one do this?). 2. I don't have access to Windows+VisualC, so I will need some help testing for Windows. 3. Should patches be posted to the mailing list or checked into svn? James ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Openmp support (was numpy's future (1.1 and beyond): which direction(s) ?)
Am 22.03.2008 um 19:20 schrieb Travis E. Oliphant: I think the thing to do is to special-case the code so that if the strides work for vectorization, then a different bit of code is executed and this current code is used as the final special-case. Something like this would be relatively straightforward, if a bit tedious, to do. I've experimented with branching the ufuncs into different constant strides and aligned/unaligned cases to be able to use SSE using compiler intrinsics. I expected a considerable gain as i was using float32 with stride 1 most of the time. However, profiling revealed that hardly anything was gained because of 1) non-alignment of the vectors this _could_ be handled by shuffled loading of the values though 2) the fact that my application used relatively large vectors that wouldn't fit into the CPU cache, hence the memory transfer slowed down the CPU. I found the latter to be a real showstopper for most of my experiments with SIMD. It's especially a problem for numpy because smaller vectors have a lot of Python/numpy overhead, and larger ones don't really benefit due to cache exhaustion. I'm curious whether OpenMP gives better results, as multi-cores often share their caches. greetings, Thomas ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Openmp support (was numpy's future (1.1 and beyond): which direction(s) ?)
On 22/03/2008, Thomas Grill [EMAIL PROTECTED] wrote: I've experimented with branching the ufuncs into different constant strides and aligned/unaligned cases to be able to use SSE using compiler intrinsics. I expected a considerable gain as i was using float32 with stride 1 most of the time. However, profiling revealed that hardly anything was gained because of 1) non-alignment of the vectors this _could_ be handled by shuffled loading of the values though 2) the fact that my application used relatively large vectors that wouldn't fit into the CPU cache, hence the memory transfer slowed down the CPU. I found the latter to be a real showstopper for most of my experiments with SIMD. It's especially a problem for numpy because smaller vectors have a lot of Python/numpy overhead, and larger ones don't really benefit due to cache exhaustion. This particular issue can sometimes be reduced by clever use of the prefetching intrinsics. I'm not totally sure it's going to help inside most ufuncs, though, since the runtime is so dominated by memory reads. In a program I was writing I had time to do a 128-point real FFT in the time it took to load the next 64 floats... Anne ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Openmp support (was numpy's future (1.1 and beyond): which direction(s) ?)
On 22/03/2008, Travis E. Oliphant [EMAIL PROTECTED] wrote: James Philbin wrote: Personally, I think that the time would be better spent optimizing routines for single-threaded code and relying on BLAS and LAPACK libraries to use multiple cores for more complex calculations. In particular, doing some basic loop unrolling and SSE versions of the ufuncs would be beneficial. I have some experience writing SSE code using intrinsics and would be happy to give it a shot if people tell me what functions I should focus on. Fabulous! This is on my Project List of todo items for NumPy. See http://projects.scipy.org/scipy/numpy/wiki/ProjectIdeas I should spend some time refactoring the ufunc loops so that the templating does not get in the way of doing this on a case by case basis. 1) You should focus on the math operations: add, subtract, multiply, divide, and so forth. 2) Then for combined operations we should expose the functionality at a high-level. So, that somebody could write code to take advantage of it. It would be easiest to use intrinsics which would then work for AMD, Intel, on multiple compilers. I think even heavier use of code generation would be a good idea here. There are so many different versions of each loop, and the fastest way to run each one is going to be different for different versions and different platforms, that a routine that assembled the code from chunks and picked the fastest combination for each instance might make a big difference - this is roughly what FFTW and ATLAS do. There are also some optimizations to be made at a higher level that might give these optimizations more traction. For example: A = randn(100*100) A.shape = (100,100) A*A There's no reason the multiply ufunc couldn't flatten A and use a single unstrided loop to do the multiplication. Anne ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Openmp support (was numpy's future (1.1 and beyond): which direction(s) ?)
However, profiling revealed that hardly anything was gained because of 1) non-alignment of the vectors this _could_ be handled by shuffled loading of the values though 2) the fact that my application used relatively large vectors that wouldn't fit into the CPU cache, hence the memory transfer slowed down the CPU. I've had generally positive results from vectorizing code in the past, admittedly on architectures with fast memory buses (Xeon 5100s). Naive implementations of most simple vector operations (dot,+,-,etc) were sped up by around ~20%. I also haven't found aligned accesses to make much difference (~2-3%), but this might be dependent on the architecture. James ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Openmp support (was numpy's future (1.1 and beyond): which direction(s) ?)
On Sat, Mar 22, 2008 at 12:54 PM, Anne Archibald [EMAIL PROTECTED] wrote: On 22/03/2008, Travis E. Oliphant [EMAIL PROTECTED] wrote: James Philbin wrote: Personally, I think that the time would be better spent optimizing routines for single-threaded code and relying on BLAS and LAPACK libraries to use multiple cores for more complex calculations. In particular, doing some basic loop unrolling and SSE versions of the ufuncs would be beneficial. I have some experience writing SSE code using intrinsics and would be happy to give it a shot if people tell me what functions I should focus on. Fabulous! This is on my Project List of todo items for NumPy. See http://projects.scipy.org/scipy/numpy/wiki/ProjectIdeas I should spend some time refactoring the ufunc loops so that the templating does not get in the way of doing this on a case by case basis. 1) You should focus on the math operations: add, subtract, multiply, divide, and so forth. 2) Then for combined operations we should expose the functionality at a high-level. So, that somebody could write code to take advantage of it. It would be easiest to use intrinsics which would then work for AMD, Intel, on multiple compilers. I think even heavier use of code generation would be a good idea here. There are so many different versions of each loop, and the fastest way to run each one is going to be different for different versions and different platforms, that a routine that assembled the code from chunks and picked the fastest combination for each instance might make a big difference - this is roughly what FFTW and ATLAS do. Maybe it's time to revisit the template subsystem I pulled out of Django. Chuck ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Openmp support (was numpy's future (1.1 and beyond): which direction(s) ?)
Anne Archibald wrote: On 22/03/2008, Travis E. Oliphant [EMAIL PROTECTED] wrote: James Philbin wrote: Personally, I think that the time would be better spent optimizing routines for single-threaded code and relying on BLAS and LAPACK libraries to use multiple cores for more complex calculations. In particular, doing some basic loop unrolling and SSE versions of the ufuncs would be beneficial. I have some experience writing SSE code using intrinsics and would be happy to give it a shot if people tell me what functions I should focus on. Fabulous! This is on my Project List of todo items for NumPy. See http://projects.scipy.org/scipy/numpy/wiki/ProjectIdeas I should spend some time refactoring the ufunc loops so that the templating does not get in the way of doing this on a case by case basis. 1) You should focus on the math operations: add, subtract, multiply, divide, and so forth. 2) Then for combined operations we should expose the functionality at a high-level. So, that somebody could write code to take advantage of it. It would be easiest to use intrinsics which would then work for AMD, Intel, on multiple compilers. I think even heavier use of code generation would be a good idea here. There are so many different versions of each loop, and the fastest way to run each one is going to be different for different versions and different platforms, that a routine that assembled the code from chunks and picked the fastest combination for each instance might make a big difference - this is roughly what FFTW and ATLAS do. There are also some optimizations to be made at a higher level that might give these optimizations more traction. For example: A = randn(100*100) A.shape = (100,100) A*A There's no reason the multiply ufunc couldn't flatten A and use a single unstrided loop to do the multiplication. Good idea, it does already do that :-) The ufunc machinery is also a good place for an optional thread pool. Perhaps we could drum up interest in a Need for Speed Sprint on NumPy sometime over the next few months. -Travis O. ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Openmp support (was numpy's future (1.1 and beyond): which direction(s) ?)
On Sat, Mar 22, 2008 at 2:04 PM, Charles R Harris [EMAIL PROTECTED] wrote: Maybe it's time to revisit the template subsystem I pulled out of Django. I am still -lots on using the Django template system. Please, please, please, look at Jinja or another templating package that could be dropped in without *any* modification. -- Robert Kern I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth. -- Umberto Eco ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Openmp support (was numpy's future (1.1 and beyond): which direction(s) ?)
On Sat, Mar 22, 2008 at 2:59 PM, Robert Kern [EMAIL PROTECTED] wrote: On Sat, Mar 22, 2008 at 2:04 PM, Charles R Harris [EMAIL PROTECTED] wrote: Maybe it's time to revisit the template subsystem I pulled out of Django. I am still -lots on using the Django template system. Please, please, please, look at Jinja or another templating package that could be dropped in without *any* modification. Well, I have a script that pulls the relevant parts out of Django. I know you had a bad experience, but... That said, Jinja looks interesting. It uses the Django syntax, which was one of the things I liked most about Django templates. In fact, it looks pretty much like Django templates made into a standalone application, which is what I was after. However, it's big, the installed egg is about 1Mib, which is roughly 12x the size as my cutdown version of Django, and it has some c-code, so would need building. On the other hand, it also looks like it contains a lot of extraneous stuff, like translations, that could be removed. Would you be adverse to adding it in if it looks useful? Chuck ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Openmp support (was numpy's future (1.1 and beyond): which direction(s) ?)
On Sat, Mar 22, 2008 at 8:16 PM, Travis E. Oliphant [EMAIL PROTECTED] wrote: Perhaps we could drum up interest in a Need for Speed Sprint on NumPy sometime over the next few months. I guess we'd all like our computations to complete more quickly, as long as they still give valid results. I suggest we make sure that we have very decent test coverage of the C code before doing any further optimization. The regression tests cover a number of important corner cases, but we don't have all that many tests covering everyday usage. The ideal place for these would be inside the docstrings -- and if we get the wiki - docstring roundtripping working properly, anyone would be able to contribute towards that goal. I know that is is possible to track coverage using gcov (you have to compile numpy into the python binary), but if anyone has a better way, I'd like to hear about it. Regards Stéfan ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Openmp support (was numpy's future (1.1 and beyond): which direction(s) ?)
OK, i've written a simple benchmark which implements an elementwise multiply (A=B*C) in three different ways (standard C, intrinsics, hand coded assembly). On the face of things the results seem to indicate that the vectorization works best on medium sized inputs. If people could post the results of running the benchmark on their machines (takes ~1min) along with the output of gcc --version and their chip model, that wd be v useful. It should be compiled with: gcc -msse -O2 vec_bench.c -o vec_bench Here's two: CPU: Core Duo T2500 @ 2GHz gcc --version: gcc (GCC) 4.1.2 (Ubuntu 4.1.2-0ubuntu4) Problem size Simple Intrin Inline 100 0.0003ms (100.0%) 0.0002ms ( 67.7%) 0.0002ms ( 50.6%) 1000 0.0030ms (100.0%) 0.0021ms ( 69.2%) 0.0015ms ( 50.6%) 1 0.0370ms (100.0%) 0.0267ms ( 72.0%) 0.0279ms ( 75.4%) 10 0.2258ms (100.0%) 0.1469ms ( 65.0%) 0.1273ms ( 56.4%) 100 4.5690ms (100.0%) 4.4616ms ( 97.6%) 4.4185ms ( 96.7%) 1000 47.0022ms (100.0%) 45.4100ms ( 96.6%) 44.4437ms ( 94.6%) CPU: Intel Xeon E5345 @ 2.33Ghz gcc --version: gcc (GCC) 4.1.2 20070925 (Red Hat 4.1.2-33) Problem size Simple Intrin Inline 100 0.0001ms (100.0%) 0.0001ms ( 69.2%) 0.0001ms ( 77.4%) 1000 0.0010ms (100.0%) 0.0008ms ( 78.1%) 0.0009ms ( 86.6%) 1 0.0108ms (100.0%) 0.0088ms ( 81.2%) 0.0086ms ( 79.6%) 10 0.1131ms (100.0%) 0.0897ms ( 79.3%) 0.0872ms ( 77.1%) 100 5.2103ms (100.0%) 3.9153ms ( 75.1%) 3.8328ms ( 73.6%) 1000 54.1815ms (100.0%) 51.8286ms ( 95.7%) 51.4366ms ( 94.9%) James #include assert.h #include stdio.h #include stdlib.h #include math.h #include xmmintrin.h int sizes[6] = {100, 1000, 1, 10, 100, 1000}; int nsizes = 6; int repeats = 300; double accurate_time() { struct timeval t; gettimeofday(t,NULL); return (double)t.tv_sec + t.tv_usec*0.01; } void rand_vector(int sz, float* vec) { int i; for (i=0; isz; i++) { vec[i] = (float)rand()/RAND_MAX; } } typedef void (*vec_func)(float*, float*, float*, int); double time_func(vec_func func, int sz) { int i; double t1, t2; float *v1, *v2, *v3; v1 = (float*)malloc(sizeof(float)*sz); v2 = (float*)malloc(sizeof(float)*sz); v3 = (float*)malloc(sizeof(float)*sz); rand_vector(sz, v1); rand_vector(sz, v2); t1 = accurate_time(); for (i=0; irepeats; i++) { func(v1, v2, v3, sz); } t2 = accurate_time(); free(v1); free(v2); free(v3); return (t2-t1)/repeats; } void multiply_float_simple(float* a, float* b, float* c, int sz) { int i; for (i=0; isz; i++) { c[i] = a[i]*b[i]; } } void multiply_float_intrin(float* a, float* b, float* c, int sz) { __m128 a_, b_, c_; int i; for (i = 0; i(sz -4); i+=4) { a_ = _mm_loadu_ps(a + i); b_ = _mm_loadu_ps(b + i); c_ = _mm_mul_ps(a_, b_); _mm_storeu_ps(c + i, c_); } for ( ; i sz; i++) { a_ = _mm_load_ss(a + i); b_ = _mm_load_ss(b + i); c_ = _mm_mul_ss(a_, b_); _mm_store_ss(c + i, c_); } } void multiply_float_inline(float* a, float* b, float* c, int sz) { #ifndef __SSE__ #error Use -msse #else __asm__ __volatile__( mov %[sz], %%eax\n\t test $-4, %%eax\n\t jz .L_11_%=\n\t .L_10_%=:\n\t movlps 0(%[a]), %%xmm0\n\t // xmm0 = a movhps 8(%[a]), %%xmm0\n\t movlps 0(%[b]), %%xmm1\n\t // xmm1 = b movhps 8(%[b]), %%xmm1\n\t mulps %%xmm0, %%xmm1\n\t // xmm1 = a*b movlps %%xmm1, 0(%[c])\n\t movhps %%xmm1, 8(%[c])\n\t add $16, %[a]\n\t add $16, %[b]\n\t add $16, %[c]\n\t sub $4, %%eax\n\t test $-4, %%eax\n\t jnz .L_10_%=\n\t .L_11_%=:\n\t test $-1, %%eax\n\t jz .L_13_%=\n\t .L_12_%=:\n\t movss 0(%[a]), %%xmm0\n\t movss 0(%[b]), %%xmm1\n\t mulss %%xmm0, %%xmm1\n\t movss %%xmm1, 0(%[c])\n\t add $4, %[a]\n\t add $4, %[b]\n\t add $4, %[c]\n\t sub $1, %%eax\n\t test $-1, %%eax\n\t jnz .L_12_%=\n\t .L_13_%=:\n\t : // Outputs [a] =r (a), [b] =r (b), [c] =r (c) : // Inputs 0 (a), 1 (b), 2 (c), [sz] m (sz) : %eax, %xmm0, %xmm1 ); #endif } vec_func methods[3] = {multiply_float_simple, multiply_float_intrin, multiply_float_inline}; char *method_names[3] = {Simple, Intrin, Inline}; int nmethods = 3; void test_methods(void) { int test_size = 10003; // Not divisible by 4 int i, j; float *v1, *v2, *v3, *v4; v1 = (float*)malloc(test_size*sizeof(float)); v2 = (float*)malloc(test_size*sizeof(float)); v3 = (float*)malloc(test_size*sizeof(float)); v4 = (float*)malloc(test_size*sizeof(float)); rand_vector(test_size, v1); rand_vector(test_size, v2); methods[0](v1, v2, v3, test_size); for (i=1; inmethods; i++) { methods[i](v1, v2, v4,
Re: [Numpy-discussion] Openmp support (was numpy's future (1.1 and beyond): which direction(s) ?)
gcc --version gcc (GCC) 4.1.2 20070925 (Red Hat 4.1.2-33) Copyright (C) 2006 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. [EMAIL PROTECTED] ~]$ cat /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 15 model name : Intel(R) Core(TM)2 Duo CPU T7500 @ 2.20GHz stepping: 11 cpu MHz : 2201.000 cache size : 4096 KB physical id : 0 siblings: 2 core id : 0 cpu cores : 2 fpu : yes fpu_exception : yes cpuid level : 10 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr lahf_lm ida bogomips: 4393.14 clflush size: 64 cache_alignment : 64 address sizes : 36 bits physical, 48 bits virtual power management: processor : 1 vendor_id : GenuineIntel cpu family : 6 model : 15 model name : Intel(R) Core(TM)2 Duo CPU T7500 @ 2.20GHz stepping: 11 cpu MHz : 2201.000 cache size : 4096 KB physical id : 0 siblings: 2 core id : 1 cpu cores : 2 fpu : yes fpu_exception : yes cpuid level : 10 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr lahf_lm ida bogomips: 4389.47 clflush size: 64 cache_alignment : 64 address sizes : 36 bits physical, 48 bits virtual power management: [EMAIL PROTECTED] ~]$ gcc -O2 vec_bench.c -o vec_bench [EMAIL PROTECTED] ~]$ ./vec_bench Testing methods... All OK Problem size Simple Intrin Inline 100 0.0003ms (100.0%) 0.0003ms ( 78.3%) 0.0003ms ( 75.5%) 1000 0.0029ms (100.0%) 0.0022ms ( 75.9%) 0.0026ms ( 87.0%) 1 0.0131ms (100.0%) 0.0085ms ( 65.0%) 0.0092ms ( 70.3%) 10 0.1210ms (100.0%) 0.0875ms ( 72.3%) 0.0932ms ( 77.0%) 100 4.2518ms (100.0%) 7.5801ms (178.3%) 7.6278ms (179.4%) 1000 81.6962ms (100.0%) 79.8668ms ( 97.8%) 81.6365ms ( 99.9%) [EMAIL PROTECTED] ~]$ gcc -O3 -ffast-math vec_bench.c -o vec_bench [EMAIL PROTECTED] ~]$ ./vec_bench Testing methods... All OK Problem size Simple Intrin Inline 100 0.0003ms (100.0%) 0.0002ms ( 68.4%) 0.0003ms ( 74.2%) 1000 0.0029ms (100.0%) 0.0023ms ( 77.2%) 0.0025ms ( 86.9%) 1 0.0353ms (100.0%) 0.0086ms ( 24.5%) 0.0092ms ( 26.1%) 10 0.1497ms (100.0%) 0.1013ms ( 67.6%) 0.1146ms ( 76.6%) 100 4.4004ms (100.0%) 7.5651ms (171.9%) 7.6200ms (173.2%) 1000 81.3631ms (100.0%) 83.3591ms (102.5%) 79.8199ms ( 98.1%) [EMAIL PROTECTED] ~]$ gcc -O3 -msse4a vec_bench.c -o vec_bench [EMAIL PROTECTED] ~]$ ./vec_bench Testing methods... All OK Problem size Simple Intrin Inline 100 0.0001ms (100.0%) 0.0001ms ( 67.5%) 0.0001ms ( 74.8%) 1000 0.0011ms (100.0%) 0.0008ms ( 78.0%) 0.0009ms ( 86.4%) 1 0.0116ms (100.0%) 0.0085ms ( 73.2%) 0.0092ms ( 79.1%) 10 0.1500ms (100.0%) 0.0873ms ( 58.2%) 0.0931ms ( 62.1%) 100 4.2654ms (100.0%) 7.5623ms (177.3%) 7.5713ms (177.5%) 1000 79.4805ms (100.0%) 81.0649ms (102.0%) 81.1859ms (102.1%) ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Openmp support (was numpy's future (1.1 and beyond): which direction(s) ?)
On Sat, Mar 22, 2008 at 5:03 PM, James Philbin [EMAIL PROTECTED] wrote: OK, i've written a simple benchmark which implements an elementwise multiply (A=B*C) in three different ways (standard C, intrinsics, hand coded assembly). On the face of things the results seem to indicate that the vectorization works best on medium sized inputs. If people could post the results of running the benchmark on their machines (takes ~1min) along with the output of gcc --version and their chip model, that wd be v useful. It should be compiled with: gcc -msse -O2 vec_bench.c -o vec_bench Here's two: CPU: Core Duo T2500 @ 2GHz gcc --version: gcc (GCC) 4.1.2 (Ubuntu 4.1.2-0ubuntu4) Problem size Simple Intrin Inline 100 0.0003ms (100.0%) 0.0002ms ( 67.7%) 0.0002ms ( 50.6%) 1000 0.0030ms (100.0%) 0.0021ms ( 69.2%) 0.0015ms ( 50.6%) 1 0.0370ms (100.0%) 0.0267ms ( 72.0%) 0.0279ms ( 75.4%) 10 0.2258ms (100.0%) 0.1469ms ( 65.0%) 0.1273ms ( 56.4%) 100 4.5690ms (100.0%) 4.4616ms ( 97.6%) 4.4185ms ( 96.7%) 1000 47.0022ms (100.0%) 45.4100ms ( 96.6%) 44.4437ms ( 94.6%) CPU: Intel Xeon E5345 @ 2.33Ghz gcc --version: gcc (GCC) 4.1.2 20070925 (Red Hat 4.1.2-33) Problem size Simple Intrin Inline 100 0.0001ms (100.0%) 0.0001ms ( 69.2%) 0.0001ms ( 77.4%) 1000 0.0010ms (100.0%) 0.0008ms ( 78.1%) 0.0009ms ( 86.6%) 1 0.0108ms (100.0%) 0.0088ms ( 81.2%) 0.0086ms ( 79.6%) 10 0.1131ms (100.0%) 0.0897ms ( 79.3%) 0.0872ms ( 77.1%) 100 5.2103ms (100.0%) 3.9153ms ( 75.1%) 3.8328ms ( 73.6%) 1000 54.1815ms (100.0%) 51.8286ms ( 95.7%) 51.4366ms ( 94.9%) gcc --version: gcc (GCC) 4.1.2 20070925 (Red Hat 4.1.2-33) cpu: Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz Problem size Simple Intrin Inline 100 0.0002ms (100.0%) 0.0001ms ( 68.7%) 0.0001ms ( 74.8%) 1000 0.0015ms (100.0%) 0.0011ms ( 72.0%) 0.0012ms ( 80.4%) 1 0.0154ms (100.0%) 0.0111ms ( 72.1%) 0.0122ms ( 79.1%) 10 0.1081ms (100.0%) 0.0759ms ( 70.2%) 0.0811ms ( 75.0%) 100 2.7778ms (100.0%) 2.8172ms (101.4%) 2.7929ms ( 100.5%) 1000 28.1577ms (100.0%) 28.7332ms (102.0%) 28.4669ms ( 101.1%) It looks like memory access is the bottleneck, otherwise running 4 floats through in parallel should go a lot faster. I need to modify the program a bit and see how it works for doubles. Chuck ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Openmp support (was numpy's future (1.1 and beyond): which direction(s) ?)
Hi, here's my results: Intel Core 2 Duo, 2.16GHz, 667MHz bus, 4MB Cache running under OSX 10.5.2 please note that the auto-vectorizer of gcc-4.3 is doing really well gr~~~ - gcc version 4.0.1 (Apple Inc. build 5465) xbook-2:temp thomas$ gcc -msse -O2 vec_bench.c -o vec_bench xbook-2:temp thomas$ ./vec_bench Testing methods... All OK Problem size Simple Intrin Inline 100 0.0002ms (100.0%) 0.0001ms ( 83.2%) 0.0001ms ( 85.1%) 1000 0.0014ms (100.0%) 0.0014ms ( 99.5%) 0.0014ms ( 97.6%) 1 0.0180ms (100.0%) 0.0137ms ( 76.1%) 0.0103ms ( 56.9%) 10 0.1307ms (100.0%) 0.1153ms ( 88.2%) 0.0952ms ( 72.8%) 100 4.0309ms (100.0%) 4.1641ms (103.3%) 4.0129ms ( 99.6%) 1000 43.2557ms (100.0%) 43.5919ms (100.8%) 42.6391ms ( 98.6%) gcc version 4.3.0 20080125 (experimental) (GCC) xbook-2:temp thomas$ gcc-4.3 -msse -O2 vec_bench.c -o vec_bench xbook-2:temp thomas$ ./vec_bench Testing methods... All OK Problem size Simple Intrin Inline 100 0.0002ms (100.0%) 0.0001ms ( 77.4%) 0.0001ms ( 72.0%) 1000 0.0017ms (100.0%) 0.0014ms ( 84.4%) 0.0014ms ( 79.4%) 1 0.0173ms (100.0%) 0.0148ms ( 85.4%) 0.0104ms ( 59.9%) 10 0.1276ms (100.0%) 0.1243ms ( 97.4%) 0.0952ms ( 74.6%) 100 4.0466ms (100.0%) 4.1168ms (101.7%) 4.0348ms ( 99.7%) 1000 43.1842ms (100.0%) 43.2989ms (100.3%) 44.2171ms (102.4%) xbook-2:temp thomas$ gcc-4.3 -msse -O2 -ftree-vectorize vec_bench.c -o vec_bench xbook-2:temp thomas$ ./vec_bench Testing methods... All OK Problem size Simple Intrin Inline 100 0.0001ms (100.0%) 0.0001ms (126.6%) 0.0001ms (120.3%) 1000 0.0011ms (100.0%) 0.0014ms (136.3%) 0.0014ms (127.9%) 1 0.0144ms (100.0%) 0.0153ms (106.3%) 0.0103ms ( 72.0%) 10 0.1027ms (100.0%) 0.1243ms (121.0%) 0.0953ms ( 92.8%) 100 3.9691ms (100.0%) 4.1197ms (103.8%) 4.0252ms (101.4%) 1000 42.1922ms (100.0%) 43.6721ms (103.5%) 43.4035ms (102.9%) ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Openmp support (was numpy's future (1.1 and beyond): which direction(s) ?)
On Sat, Mar 22, 2008 at 5:32 PM, Charles R Harris [EMAIL PROTECTED] wrote: On Sat, Mar 22, 2008 at 5:03 PM, James Philbin [EMAIL PROTECTED] wrote: OK, i've written a simple benchmark which implements an elementwise multiply (A=B*C) in three different ways (standard C, intrinsics, hand coded assembly). On the face of things the results seem to indicate that the vectorization works best on medium sized inputs. If people could post the results of running the benchmark on their machines (takes ~1min) along with the output of gcc --version and their chip model, that wd be v useful. It should be compiled with: gcc -msse -O2 vec_bench.c -o vec_bench Here's two: CPU: Core Duo T2500 @ 2GHz gcc --version: gcc (GCC) 4.1.2 (Ubuntu 4.1.2-0ubuntu4) Problem size Simple Intrin Inline 100 0.0003ms (100.0%) 0.0002ms ( 67.7%) 0.0002ms ( 50.6%) 1000 0.0030ms (100.0%) 0.0021ms ( 69.2%) 0.0015ms ( 50.6%) 1 0.0370ms (100.0%) 0.0267ms ( 72.0%) 0.0279ms ( 75.4%) 10 0.2258ms (100.0%) 0.1469ms ( 65.0%) 0.1273ms ( 56.4%) 100 4.5690ms (100.0%) 4.4616ms ( 97.6%) 4.4185ms ( 96.7%) 1000 47.0022ms (100.0%) 45.4100ms ( 96.6%) 44.4437ms ( 94.6%) CPU: Intel Xeon E5345 @ 2.33Ghz gcc --version: gcc (GCC) 4.1.2 20070925 (Red Hat 4.1.2-33) Problem size Simple Intrin Inline 100 0.0001ms (100.0%) 0.0001ms ( 69.2%) 0.0001ms ( 77.4%) 1000 0.0010ms (100.0%) 0.0008ms ( 78.1%) 0.0009ms ( 86.6%) 1 0.0108ms (100.0%) 0.0088ms ( 81.2%) 0.0086ms ( 79.6%) 10 0.1131ms (100.0%) 0.0897ms ( 79.3%) 0.0872ms ( 77.1%) 100 5.2103ms (100.0%) 3.9153ms ( 75.1%) 3.8328ms ( 73.6%) 1000 54.1815ms (100.0%) 51.8286ms ( 95.7%) 51.4366ms ( 94.9%) gcc --version: gcc (GCC) 4.1.2 20070925 (Red Hat 4.1.2-33) cpu: Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz Problem size Simple Intrin Inline 100 0.0002ms (100.0%) 0.0001ms ( 68.7%) 0.0001ms ( 74.8%) 1000 0.0015ms (100.0%) 0.0011ms ( 72.0%) 0.0012ms ( 80.4%) 1 0.0154ms (100.0%) 0.0111ms ( 72.1%) 0.0122ms ( 79.1%) 10 0.1081ms (100.0%) 0.0759ms ( 70.2%) 0.0811ms ( 75.0%) 100 2.7778ms (100.0%) 2.8172ms (101.4%) 2.7929ms ( 100.5%) 1000 28.1577ms (100.0%) 28.7332ms (102.0%) 28.4669ms ( 101.1%) It looks like memory access is the bottleneck, otherwise running 4 floats through in parallel should go a lot faster. I need to modify the program a bit and see how it works for doubles. Doubles don't look so good running on a 32 bit OS. Maybe alignment would help. Compiled with gcc -msse2 -mfpmath=sse -O2 vec_bench_dbl.c -o vec_bench_dbl Problem size Simple Intrin 100 0.0002ms (100.0%) 0.0002ms (149.5%) 1000 0.0015ms (100.0%) 0.0024ms (159.0%) 1 0.0219ms (100.0%) 0.0180ms ( 81.9%) 10 0.1518ms (100.0%) 0.1686ms (111.1%) 100 5.5588ms (100.0%) 5.8145ms (104.6%) 1000 56.7152ms (100.0%) 59.3139ms (104.6%) Chuck ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Openmp support (was numpy's future (1.1 and beyond): which direction(s) ?)
On Sat, Mar 22, 2008 at 6:34 PM, Charles R Harris [EMAIL PROTECTED] wrote: I've attached a double version. Compile with gcc -msse2 -mfpmath=sse -O2 vec_bench_dbl.c -o vec_bench_dbl Chuck #include assert.h #include stdio.h #include stdlib.h #include math.h #include emmintrin.h int sizes[6] = {100, 1000, 1, 10, 100, 1000}; int nsizes = 6; int repeats = 300; double accurate_time() { struct timeval t; gettimeofday(t,NULL); return (double)t.tv_sec + t.tv_usec*0.01; } void rand_vector(int sz, double* vec) { int i; for (i=0; isz; i++) { vec[i] = (double)rand()/RAND_MAX; } } typedef void (*vec_func)(double*, double*, double*, int); double time_func(vec_func func, int sz) { int i; double t1, t2; double *v1, *v2, *v3; v1 = (double*)malloc(sizeof(double)*sz); v2 = (double*)malloc(sizeof(double)*sz); v3 = (double*)malloc(sizeof(double)*sz); rand_vector(sz, v1); rand_vector(sz, v2); t1 = accurate_time(); for (i=0; irepeats; i++) { func(v1, v2, v3, sz); } t2 = accurate_time(); free(v1); free(v2); free(v3); return (t2-t1)/repeats; } void multiply_double_simple(double* a, double* b, double* c, int sz) { int i; for (i=0; isz; i++) { c[i] = a[i]*b[i]; } } void multiply_double_intrin(double* a, double* b, double* c, int sz) { __m128d a_, b_, c_; int i; for (i = 0; i(sz -2); i+=2) { a_ = _mm_loadu_pd(a + i); b_ = _mm_loadu_pd(b + i); c_ = _mm_mul_pd(a_, b_); _mm_storeu_pd(c + i, c_); } for ( ; i sz; i++) { a_ = _mm_load_sd(a + i); b_ = _mm_load_sd(b + i); c_ = _mm_mul_sd(a_, b_); _mm_store_sd(c + i, c_); } } /* void multiply_float_inline(float* a, float* b, float* c, int sz) { #ifndef __SSE__ #error Use -msse #else __asm__ __volatile__( mov %[sz], %%eax\n\t test $-4, %%eax\n\t jz .L_11_%=\n\t .L_10_%=:\n\t movlps 0(%[a]), %%xmm0\n\t // xmm0 = a movhps 8(%[a]), %%xmm0\n\t movlps 0(%[b]), %%xmm1\n\t // xmm1 = b movhps 8(%[b]), %%xmm1\n\t mulps %%xmm0, %%xmm1\n\t // xmm1 = a*b movlps %%xmm1, 0(%[c])\n\t movhps %%xmm1, 8(%[c])\n\t add $16, %[a]\n\t add $16, %[b]\n\t add $16, %[c]\n\t sub $4, %%eax\n\t test $-4, %%eax\n\t jnz .L_10_%=\n\t .L_11_%=:\n\t test $-1, %%eax\n\t jz .L_13_%=\n\t .L_12_%=:\n\t movss 0(%[a]), %%xmm0\n\t movss 0(%[b]), %%xmm1\n\t mulss %%xmm0, %%xmm1\n\t movss %%xmm1, 0(%[c])\n\t add $4, %[a]\n\t add $4, %[b]\n\t add $4, %[c]\n\t sub $1, %%eax\n\t test $-1, %%eax\n\t jnz .L_12_%=\n\t .L_13_%=:\n\t : // Outputs [a] =r (a), [b] =r (b), [c] =r (c) : // Inputs 0 (a), 1 (b), 2 (c), [sz] m (sz) : %eax, %xmm0, %xmm1 ); #endif } */ vec_func methods[3] = {multiply_double_simple, multiply_double_intrin}; char *method_names[3] = {Simple, Intrin}; int nmethods = 2; void test_methods(void) { int test_size = 10003; // Not divisible by 4 int i, j; double *v1, *v2, *v3, *v4; v1 = (double*)malloc(test_size*sizeof(double)); v2 = (double*)malloc(test_size*sizeof(double)); v3 = (double*)malloc(test_size*sizeof(double)); v4 = (double*)malloc(test_size*sizeof(double)); rand_vector(test_size, v1); rand_vector(test_size, v2); methods[0](v1, v2, v3, test_size); for (i=1; inmethods; i++) { methods[i](v1, v2, v4, test_size); for (j=0; jtest_size; j++) { assert(v4[j] == v3[j]); } } free(v1); free(v2); free(v3); free(v4); } int main(void) { int i, j; double dt; double dt_simp; printf(Testing methods...\n); test_methods(); printf(All OK\n\n); printf(%20s, Problem size); for (i=0; inmethods; i++) { printf(%20s, method_names[i]); } printf(\n); for (i=0; insizes; i++) { printf(%20d, sizes[i]); dt_simp = time_func(methods[0], sizes[i])*1000.0; printf( %7.4fms (%5.1f%%), dt_simp, 100.0); for (j=1; jnmethods; j++) { dt = time_func(methods[j], sizes[i])*1000.0; printf( %7.4fms (%5.1f%%), dt, (100.0*dt)/dt_simp); } printf(\n); } } ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Openmp support (was numpy's future (1.1 and beyond): which direction(s) ?)
Here are results under 64-bit linux using gcc-4.3 (which by default turns on the various sse flags). Note that -O3 is significantly better than -O2 for the simple calls: nimrod:~$ cat /proc/cpuinfo | grep model name | head -1 model name : Intel(R) Xeon(R) CPU E5450 @ 3.00GHz nimrod:~$ gcc-4.3 --version gcc-4.3 (Debian 4.3.0-1) 4.3.1 20080309 (prerelease) nimrod:~$ gcc-4.3 -O2 vec_bench.c -o vec_bench nimrod:~$ ./vec_bench Testing methods... All OK Problem size Simple Intrin Inline 100 0.0001ms (100.0%) 0.0001ms ( 70.8%) 0.0001ms ( 74.3%) 1000 0.0008ms (100.0%) 0.0006ms ( 70.3%) 0.0007ms ( 80.3%) 1 0.0085ms (100.0%) 0.0061ms ( 72.0%) 0.0067ms ( 78.8%) 10 0.0882ms (100.0%) 0.0627ms ( 71.1%) 0.0677ms ( 76.7%) 100 3.6748ms (100.0%) 3.3312ms ( 90.7%) 3.3139ms ( 90.2%) 1000 37.1154ms (100.0%) 35.9762ms ( 96.9%) 36.1126ms ( 97.3%) nimrod:~$ gcc-4.3 -O3 vec_bench.c -o vec_bench nimrod:~$ ./vec_bench Testing methods... All OK Problem size Simple Intrin Inline 100 0.0001ms (100.0%) 0.0001ms (111.1%) 0.0001ms (116.7%) 1000 0.0005ms (100.0%) 0.0006ms (111.3%) 0.0007ms (126.8%) 1 0.0056ms (100.0%) 0.0061ms (108.6%) 0.0067ms (118.9%) 10 0.0581ms (100.0%) 0.0626ms (107.8%) 0.0677ms (116.5%) 100 3.4549ms (100.0%) 3.3339ms ( 96.5%) 3.3255ms ( 96.3%) 1000 34.8186ms (100.0%) 35.9767ms (103.3%) 36.1099ms (103.7%) nimrod:~$ ./vec_bench_dbl Testing methods... All OK Problem size Simple Intrin 100 0.0001ms (100.0%) 0.0001ms (132.5%) 1000 0.0009ms (100.0%) 0.0012ms (134.5%) 1 0.0119ms (100.0%) 0.0124ms (104.1%) 10 0.1226ms (100.0%) 0.1276ms (104.1%) 100 7.0047ms (100.0%) 6.6654ms ( 95.2%) 1000 70.0060ms (100.0%) 71.9692ms (102.8%) nimrod:~$ gcc-4.3 -O3 vec_bench_dbl.c -o vec_bench_dbl nimrod:~$ ./vec_bench_dbl Testing methods... All OK Problem size Simple Intrin 100 0.0001ms (100.0%) 0.0002ms (289.8%) 1000 0.0007ms (100.0%) 0.0012ms (172.7%) 1 0.0114ms (100.0%) 0.0124ms (109.4%) 10 0.1159ms (100.0%) 0.1278ms (110.3%) 100 6.9252ms (100.0%) 6.6585ms ( 96.1%) 1000 69.1913ms (100.0%) 71.9664ms (104.0%) On Sat, Mar 22, 2008 at 06:41:31PM -0600, Charles R Harris wrote: On Sat, Mar 22, 2008 at 6:34 PM, Charles R Harris [EMAIL PROTECTED] wrote: I've attached a double version. Compile with gcc -msse2 -mfpmath=sse -O2 vec_bench_dbl.c -o vec_bench_dbl Chuck ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion -- -- Scott M. RansomAddress: NRAO Phone: (434) 296-0320 520 Edgemont Rd. email: [EMAIL PROTECTED] Charlottesville, VA 22903 USA GPG Fingerprint: 06A9 9553 78BE 16DB 407B FFCA 9BFA B6FF FFD3 2989 ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Openmp support (was numpy's future (1.1 and beyond): which direction(s) ?)
Thomas Grill wrote: Hi, here's my results: Intel Core 2 Duo, 2.16GHz, 667MHz bus, 4MB Cache running under OSX 10.5.2 please note that the auto-vectorizer of gcc-4.3 is doing really well gr~~~ - gcc version 4.0.1 (Apple Inc. build 5465) xbook-2:temp thomas$ gcc -msse -O2 vec_bench.c -o vec_bench xbook-2:temp thomas$ ./vec_bench Testing methods... All OK Problem size Simple Intrin Inline 100 0.0002ms (100.0%) 0.0001ms ( 83.2%) 0.0001ms ( 85.1%) 1000 0.0014ms (100.0%) 0.0014ms ( 99.5%) 0.0014ms ( 97.6%) 1 0.0180ms (100.0%) 0.0137ms ( 76.1%) 0.0103ms ( 56.9%) 10 0.1307ms (100.0%) 0.1153ms ( 88.2%) 0.0952ms ( 72.8%) 100 4.0309ms (100.0%) 4.1641ms (103.3%) 4.0129ms ( 99.6%) 1000 43.2557ms (100.0%) 43.5919ms (100.8%) 42.6391ms ( 98.6%) gcc version 4.3.0 20080125 (experimental) (GCC) xbook-2:temp thomas$ gcc-4.3 -msse -O2 vec_bench.c -o vec_bench xbook-2:temp thomas$ ./vec_bench Testing methods... All OK Problem size Simple Intrin Inline 100 0.0002ms (100.0%) 0.0001ms ( 77.4%) 0.0001ms ( 72.0%) 1000 0.0017ms (100.0%) 0.0014ms ( 84.4%) 0.0014ms ( 79.4%) 1 0.0173ms (100.0%) 0.0148ms ( 85.4%) 0.0104ms ( 59.9%) 10 0.1276ms (100.0%) 0.1243ms ( 97.4%) 0.0952ms ( 74.6%) 100 4.0466ms (100.0%) 4.1168ms (101.7%) 4.0348ms ( 99.7%) 1000 43.1842ms (100.0%) 43.2989ms (100.3%) 44.2171ms (102.4%) xbook-2:temp thomas$ gcc-4.3 -msse -O2 -ftree-vectorize vec_bench.c -o vec_bench xbook-2:temp thomas$ ./vec_bench Testing methods... All OK Problem size Simple Intrin Inline 100 0.0001ms (100.0%) 0.0001ms (126.6%) 0.0001ms (120.3%) 1000 0.0011ms (100.0%) 0.0014ms (136.3%) 0.0014ms (127.9%) 1 0.0144ms (100.0%) 0.0153ms (106.3%) 0.0103ms ( 72.0%) 10 0.1027ms (100.0%) 0.1243ms (121.0%) 0.0953ms ( 92.8%) 100 3.9691ms (100.0%) 4.1197ms (103.8%) 4.0252ms (101.4%) 1000 42.1922ms (100.0%) 43.6721ms (103.5%) 43.4035ms (102.9%) gcc version 4.3.0 20080307 (Red Hat 4.3.0-2) (GCC) gcc -msse -O2 -ftree-vectorize vec_bench.c -o vec_bench mock-chroot ./vec_bench Testing methods... All OK Problem size Simple Intrin Inline 100 0.0001ms (100.0%) 0.0001ms (141.6%) 0.0001ms (108.0%) 1000 0.0008ms (100.0%) 0.0011ms (149.9%) 0.0008ms (100.4%) 1 0.0135ms (100.0%) 0.0197ms (145.8%) 0.0133ms ( 98.8%) 10 0.6415ms (100.0%) 0.4918ms ( 76.7%) 0.5052ms ( 78.8%) 100 7.5364ms (100.0%) 7.9987ms (106.1%) 7.4832ms ( 99.3%) 1000 76.3927ms (100.0%) 76.8933ms (100.7%) 75.1002ms ( 98.3%) model name : AMD Athlon(tm) 64 Processor 3200+ stepping: 10 cpu MHz : 2000.068 cache size : 1024 KB Now same, but with gcc --version gcc (GCC) 4.1.2 20070925 (Red Hat 4.1.2-33) Testing methods... All OK Problem size Simple Intrin Inline 100 0.0002ms (100.0%) 0.0001ms ( 77.2%) 0.0001ms ( 58.7%) 1000 0.0015ms (100.0%) 0.0011ms ( 73.5%) 0.0008ms ( 52.6%) 1 0.0214ms (100.0%) 0.0195ms ( 90.9%) 0.0363ms (169.3%) 10 0.6620ms (100.0%) 0.5614ms ( 84.8%) 0.5527ms ( 83.5%) 100 7.5975ms (100.0%) 7.3826ms ( 97.2%) 7.3380ms ( 96.6%) 1000 75.8361ms (100.0%) 84.0476ms (110.8%) 77.2884ms (101.9%) ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Openmp support (was numpy's future (1.1 and beyond): which direction(s) ?)
On Sat, Mar 22, 2008 at 7:35 PM, Scott Ransom [EMAIL PROTECTED] wrote: Here are results under 64-bit linux using gcc-4.3 (which by default turns on the various sse flags). Note that -O3 is significantly better than -O2 for the simple calls: nimrod:~$ cat /proc/cpuinfo | grep model name | head -1 model name : Intel(R) Xeon(R) CPU E5450 @ 3.00GHz nimrod:~$ gcc-4.3 --version gcc-4.3 (Debian 4.3.0-1) 4.3.1 20080309 (prerelease) nimrod:~$ gcc-4.3 -O2 vec_bench.c -o vec_bench nimrod:~$ ./vec_bench Testing methods... All OK Problem size Simple Intrin Inline 100 0.0001ms (100.0%) 0.0001ms ( 70.8%) 0.0001ms ( 74.3%) 1000 0.0008ms (100.0%) 0.0006ms ( 70.3%) 0.0007ms ( 80.3%) 1 0.0085ms (100.0%) 0.0061ms ( 72.0%) 0.0067ms ( 78.8%) 10 0.0882ms (100.0%) 0.0627ms ( 71.1%) 0.0677ms ( 76.7%) 100 3.6748ms (100.0%) 3.3312ms ( 90.7%) 3.3139ms ( 90.2%) 1000 37.1154ms (100.0%) 35.9762ms ( 96.9%) 36.1126ms ( 97.3%) nimrod:~$ gcc-4.3 -O3 vec_bench.c -o vec_bench nimrod:~$ ./vec_bench Testing methods... All OK Problem size Simple Intrin Inline 100 0.0001ms (100.0%) 0.0001ms (111.1%) 0.0001ms (116.7%) 1000 0.0005ms (100.0%) 0.0006ms (111.3%) 0.0007ms (126.8%) 1 0.0056ms (100.0%) 0.0061ms (108.6%) 0.0067ms (118.9%) 10 0.0581ms (100.0%) 0.0626ms (107.8%) 0.0677ms (116.5%) 100 3.4549ms (100.0%) 3.3339ms ( 96.5%) 3.3255ms ( 96.3%) 1000 34.8186ms (100.0%) 35.9767ms (103.3%) 36.1099ms (103.7%) nimrod:~$ ./vec_bench_dbl Testing methods... All OK Problem size Simple Intrin 100 0.0001ms (100.0%) 0.0001ms (132.5%) 1000 0.0009ms (100.0%) 0.0012ms (134.5%) 1 0.0119ms (100.0%) 0.0124ms (104.1%) 10 0.1226ms (100.0%) 0.1276ms (104.1%) 100 7.0047ms (100.0%) 6.6654ms ( 95.2%) 1000 70.0060ms (100.0%) 71.9692ms (102.8%) nimrod:~$ gcc-4.3 -O3 vec_bench_dbl.c -o vec_bench_dbl nimrod:~$ ./vec_bench_dbl Testing methods... All OK Problem size Simple Intrin 100 0.0001ms (100.0%) 0.0002ms (289.8%) 1000 0.0007ms (100.0%) 0.0012ms (172.7%) 1 0.0114ms (100.0%) 0.0124ms (109.4%) 10 0.1159ms (100.0%) 0.1278ms (110.3%) 100 6.9252ms (100.0%) 6.6585ms ( 96.1%) 1000 69.1913ms (100.0%) 71.9664ms (104.0%) It looks to me like the best approach here is to generate operator specific loops for arithmetic, then check the step size in the loop for contiguous data, and if found branch to a block where the pointers have been cast to the right type. The loop itself could even check for operator type by switching on the function address so that the code modifications could be localized. The compiler can do the rest. Chuck ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Openmp support (was numpy's future (1.1 and beyond): which direction(s) ?)
Charles R Harris wrote: It looks like memory access is the bottleneck, otherwise running 4 floats through in parallel should go a lot faster. I need to modify the program a bit and see how it works for doubles. I am not sure the benchmark is really meaningful: it does not uses aligned buffers (16 bytes alignement), and because of that, does not give a good idea of what can be expected from SSE. It shows why it is not so easy to get good performances, and why just throwing a few optimized loops won't work, though. Using sse/sse2 from unaligned buffers is a waste of time. Without this alignement, you need to take into account the alignement (using _mm_loadu_ps vs _mm_load_ps), and that's extremely slow, basically killing most of the speed increase you can expect from using sse. Here what I get with the above benchmark: 100 0.0002ms (100.0%) 0.0001ms ( 71.5%) 0.0001ms ( 85.0%) 1000 0.0014ms (100.0%) 0.0010ms ( 70.6%) 0.0013ms ( 96.8%) 1 0.0162ms (100.0%) 0.0095ms ( 58.2%) 0.0128ms ( 78.7%) 10 0.4189ms (100.0%) 0.4135ms ( 98.7%) 0.4149ms ( 99.0%) 100 5.9523ms (100.0%) 5.8933ms ( 99.0%) 5.8910ms ( 99.0%) 1000 58.9645ms (100.0%) 58.2620ms ( 98.8%) 58.7443ms ( 99.6%) Basically, no help at all: this is on a P4, which fpu is extremely slow if not used with optimized sse. Now, if I use posix_memalign, replace the intrinsics for aligned access, and use an accurate cycle counter (cycle.h, provided by fftw). Compiled as is: Testing methods... All OK Problem size Simple Intrin Inline 1004.16e+02 cycles (100.0%)4.04e+02 cycles ( 97.1%)4.92e+02 cycles (118.3%) 10003.66e+03 cycles (100.0%)3.11e+03 cycles ( 84.8%)4.10e+03 cycles (111.9%) 13.47e+04 cycles (100.0%)3.01e+04 cycles ( 86.7%)4.06e+04 cycles (116.8%) 101.36e+06 cycles (100.0%)1.34e+06 cycles ( 98.7%)1.45e+06 cycles (106.7%) 1001.92e+07 cycles (100.0%)1.87e+07 cycles ( 97.1%)1.89e+07 cycles ( 98.2%) 10001.86e+08 cycles (100.0%)1.80e+08 cycles ( 96.8%)1.81e+08 cycles ( 97.4%) Compiled with -DALIGNED, wich uses aligned access intrinsics: Testing methods... All OK Problem size Simple Intrin Inline 1004.16e+02 cycles (100.0%)1.96e+02 cycles ( 47.1%)4.92e+02 cycles (118.3%) 10003.82e+03 cycles (100.0%)1.56e+03 cycles ( 40.8%)4.22e+03 cycles (110.4%) 13.46e+04 cycles (100.0%)1.92e+04 cycles ( 55.5%)4.13e+04 cycles (119.4%) 101.32e+06 cycles (100.0%)1.12e+06 cycles ( 85.0%)1.16e+06 cycles ( 87.8%) 1001.95e+07 cycles (100.0%)1.92e+07 cycles ( 98.3%)1.95e+07 cycles (100.2%) 10001.82e+08 cycles (100.0%)1.79e+08 cycles ( 98.4%)1.81e+08 cycles ( 99.3%) This gives a drastic difference (I did not touch inline code, because it is sunday and I am lazy). If I use this on a sane CPU (core 2 duo, macbook) instead of my pentium4, I get better results (in particular, sse code is never slower, and I get a double speed increase as long as the buffer can be in cache). It looks like using prefect also gives some improvements when on the edge of the cache size (my P4 has a 512 kb L2 cache): Testing methods... All OK Problem size Simple Intrin Inline 1004.16e+02 cycles (100.0%)2.52e+02 cycles ( 60.6%)4.92e+02 cycles (118.3%) 10003.55e+03 cycles (100.0%)1.85e+03 cycles ( 52.2%)4.21e+03 cycles (118.7%) 13.48e+04 cycles (100.0%)1.76e+04 cycles ( 50.6%)4.13e+04 cycles (118.9%) 101.11e+06 cycles (100.0%)7.20e+05 cycles ( 64.8%)1.12e+06 cycles (101.3%) 1001.91e+07 cycles (100.0%)1.98e+07 cycles (103.4%)1.91e+07 cycles (100.0%) 10001.83e+08 cycles (100.0%)1.90e+08 cycles (103.9%)1.82e+08 cycles ( 99.3%) The code can be seen there: http://www.ar.media.kyoto-u.ac.jp/members/david/archives/t2/vec_bench.c http://www.ar.media.kyoto-u.ac.jp/members/david/archives/t2/Makefile http://www.ar.media.kyoto-u.ac.jp/members/david/archives/t2/cycle.h Another thing that I have not seen mentioned but may worth pursuing is using SSE in element-wise operations: you can have extremely fast exp, sin, cos and co using sse. Those are much easier to include in numpy (but much more difficult