Re: [Numpy-discussion] Fwd: GPU Numpy
You do realize that the throughput from onboard (video) RAM is going to be much higher, right? It's not just the parallelization but the memory bandwidth. And as James pointed out, if you can keep most of your intermediate computation on-card, you stand to benefit immensely, even if doing some operations where the GPU provides no tangible benefit (i.e. the benefit is in aggregate and avoiding copies). Good point made here. GPU's support bandwidth O(100 GBps) (bytes not bits). Upcoming GPU's will likely break the 250 GBps mark. Even if your expressions involve low operation/memory ratios, GPU's are a big win as their memory bandwidth ishigher than CPU's L2 and even L1 caches. Regards, -- Rohit Garg http://rpg-314.blogspot.com/ Senior Undergraduate Department of Physics Indian Institute of Technology Bombay ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Fwd: GPU Numpy
A Thursday 10 September 2009 09:45:29 Rohit Garg escrigué: You do realize that the throughput from onboard (video) RAM is going to be much higher, right? It's not just the parallelization but the memory bandwidth. And as James pointed out, if you can keep most of your intermediate computation on-card, you stand to benefit immensely, even if doing some operations where the GPU provides no tangible benefit (i.e. the benefit is in aggregate and avoiding copies). Good point made here. GPU's support bandwidth O(100 GBps) (bytes not bits). Upcoming GPU's will likely break the 250 GBps mark. Even if your expressions involve low operation/memory ratios, GPU's are a big win as their memory bandwidth ishigher than CPU's L2 and even L1 caches. Where are you getting this info from? IMO the technology of memory in graphics boards cannot be so different than in commercial motherboards. It could be a *bit* faster (at the expenses of packing less of it), but I'd say not as much as 4x faster (100 GB/s vs 25 GB/s of Intel i7 in sequential access), as you are suggesting. Maybe this is GPU cache bandwidth? -- Francesc Alted ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Fwd: GPU Numpy
Where are you getting this info from? IMO the technology of memory in graphics boards cannot be so different than in commercial motherboards. It could be a *bit* faster (at the expenses of packing less of it), but I'd say not as much as 4x faster (100 GB/s vs 25 GB/s of Intel i7 in sequential access), as you are suggesting. Maybe this is GPU cache bandwidth? This is publicly documented. You can start off by looking at the wikipedia stuff. For reference, gtx280--141GBps--has 1GB ati4870--115GBps--has 1GB ati5870--153GBps (launches sept 22, 2009)--2GB models will be there too Next gen nv gpu's will *assuredly* have bandwidth in excess of 200 GBps. This is *off chip memory bandwidth* from graphics memory (aka video ram). GPU have (very small) caches but they don't reduce memory latency. -- Francesc Alted ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion -- Rohit Garg http://rpg-314.blogspot.com/ Senior Undergraduate Department of Physics Indian Institute of Technology Bombay ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Fwd: GPU Numpy
Hi Sturla, The proper way to speed up dot(a*b+c*sqrt(d), e) is to get rid of temporary intermediates. I implemented a patch http://projects.scipy.org/numpy/ticket/1153 that reduces the number of temporary intermediates. In your example from 4 to 2. There is a big improvement in terms of memory footprint, and some improvement in terms of speed (especially for large matrices) but not as much as I expected. In your example result = 0 for i in range(n): result += (a[i]*b[i] + c[i]*sqrt(d[i])) * e[i] another big speedup could come from the fact that it makes better use of the cache. That is exactly why numexpr is faster in these cases. I hope one day numpy will be able to perform such optimizations. Best, Luca ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Fwd: GPU Numpy
Citi, Luca skrev: That is exactly why numexpr is faster in these cases. I hope one day numpy will be able to perform such optimizations. I think it is going to require lazy evaluation. Whenever possible, an operator would just return a symbolic representation of the operation. This would gradually build up a tree of operators and buffers. When someone tries to read the data from an array, the buffer is created on-demand by flushing procratinated expressions. One must be sure that the buffers referenced in an incomplete expression never change. This would be easiest to ensure with immutable buffers. Numexpr is the kind of back-end a system like this would require. But a lot of the code in numexpr can be omitted because Python creates the parse tree; we would not need the expression parser in numexpr as frontend. Well... this plan is gradually getting closer to a specialized SciPy JIT-compiler. I would be fun to make if I could find time for it. Sturla Molden ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Fwd: GPU Numpy
Rohit Garg skrev: gtx280--141GBps--has 1GB ati4870--115GBps--has 1GB ati5870--153GBps (launches sept 22, 2009)--2GB models will be there too That is going to help if buffers are kept in graphics memory. But the problem is that graphics memory is a scarse resource. S.M. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Fwd: GPU Numpy
A Thursday 10 September 2009 11:11:22 Sturla Molden escrigué: Citi, Luca skrev: That is exactly why numexpr is faster in these cases. I hope one day numpy will be able to perform such optimizations. I think it is going to require lazy evaluation. Whenever possible, an operator would just return a symbolic representation of the operation. This would gradually build up a tree of operators and buffers. When someone tries to read the data from an array, the buffer is created on-demand by flushing procratinated expressions. One must be sure that the buffers referenced in an incomplete expression never change. This would be easiest to ensure with immutable buffers. Numexpr is the kind of back-end a system like this would require. But a lot of the code in numexpr can be omitted because Python creates the parse tree; we would not need the expression parser in numexpr as frontend. Well... this plan is gradually getting closer to a specialized SciPy JIT-compiler. I would be fun to make if I could find time for it. Numexpr already uses the Python parser, instead of build a new one. However the bytecode emitted after the compilation process is different, of course. Also, I don't see the point in requiring immutable buffers. Could you develop this further? -- Francesc Alted ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Fwd: GPU Numpy
On Thu, Sep 10, 2009 at 10:36:27AM +0200, Francesc Alted wrote: Where are you getting this info from? IMO the technology of memory in graphics boards cannot be so different than in commercial motherboards. It could be a *bit* faster (at the expenses of packing less of it), but I'd say not as much as 4x faster (100 GB/s vs 25 GB/s of Intel i7 in sequential access), as you are suggesting. Maybe this is GPU cache bandwidth? I believe this is simply because the transfers is made in parallel to the different processing units of the graphic card. So we are back to importance of embarrassingly parallel problems and specifying things with high-level operations rather than for loop. Ga�l ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Fwd: GPU Numpy
A Thursday 10 September 2009 10:58:13 Rohit Garg escrigué: Where are you getting this info from? IMO the technology of memory in graphics boards cannot be so different than in commercial motherboards. It could be a *bit* faster (at the expenses of packing less of it), but I'd say not as much as 4x faster (100 GB/s vs 25 GB/s of Intel i7 in sequential access), as you are suggesting. Maybe this is GPU cache bandwidth? This is publicly documented. You can start off by looking at the wikipedia stuff. For reference, gtx280--141GBps--has 1GB ati4870--115GBps--has 1GB ati5870--153GBps (launches sept 22, 2009)--2GB models will be there too Next gen nv gpu's will *assuredly* have bandwidth in excess of 200 GBps. This is *off chip memory bandwidth* from graphics memory (aka video ram). GPU have (very small) caches but they don't reduce memory latency. That's nice to see. I think I'll change my mind if someone could perform a vector-vector multiplication (a operation that is typically memory-bounded) in double precision up to 5x times faster on a gtx280 nv card than in a Intel's i7 CPU. -- Francesc Alted ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Fwd: GPU Numpy
A Thursday 10 September 2009 11:20:21 Gael Varoquaux escrigué: On Thu, Sep 10, 2009 at 10:36:27AM +0200, Francesc Alted wrote: Where are you getting this info from? IMO the technology of memory in graphics boards cannot be so different than in commercial motherboards. It could be a *bit* faster (at the expenses of packing less of it), but I'd say not as much as 4x faster (100 GB/s vs 25 GB/s of Intel i7 in sequential access), as you are suggesting. Maybe this is GPU cache bandwidth? I believe this is simply because the transfers is made in parallel to the different processing units of the graphic card. So we are back to importance of embarrassingly parallel problems and specifying things with high-level operations rather than for loop. Sure. Specially because NumPy is all about embarrasingly parallel problems (after all, this is how an ufunc works, doing operations element-by-element). The point is: are GPUs prepared to compete with a general-purpose CPUs in all- road operations, like evaluating transcendental functions, conditionals all of this with a rich set of data types? I would like to believe that this is the case, but I don't think so (at least not yet). -- Francesc Alted ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Fwd: GPU Numpy
On Thu, Sep 10, 2009 at 11:29:49AM +0200, Francesc Alted wrote: The point is: are GPUs prepared to compete with a general-purpose CPUs in all-road operations, like evaluating transcendental functions, conditionals all of this with a rich set of data types? I would like to believe that this is the case, but I don't think so (at least not yet). I believe (this is very foggy) that GPUs can implement non trivial logic on there base processing unit, so that conditionals and transcendental functions are indeed possible. Where it gets hard is when you don't have problems that can be expressed in an embarrassingly parallel manner. There are solutions there to (I believe of the message passing type), after all matrix multiplication is done on GPUs. Ga�l ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Fwd: GPU Numpy
Sure. Specially because NumPy is all about embarrasingly parallel problems (after all, this is how an ufunc works, doing operations element-by-element). The point is: are GPUs prepared to compete with a general-purpose CPUs in all-road operations, like evaluating transcendental functions, conditionals all of this with a rich set of data types? I would like to believe that this is the case, but I don't think so (at least not yet). A lot of nVidia's SDK functions is not done on GPU. There are some functions that they provide where the actual computation is done on the CPU, not on the GPU (I don't have an example here, but nVidia's forum is full of examples ;)) Matthieu -- Information System Engineer, Ph.D. Website: http://matthieu-brucher.developpez.com/ Blogs: http://matt.eifelle.com and http://blog.developpez.com/?blog=92 LinkedIn: http://www.linkedin.com/in/matthieubrucher ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Fwd: GPU Numpy
The point is: are GPUs prepared to compete with a general-purpose CPUs in all-road operations, like evaluating transcendental functions, conditionals all of this with a rich set of data types? Yup. -- Rohit Garg http://rpg-314.blogspot.com/ Senior Undergraduate Department of Physics Indian Institute of Technology Bombay ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Fwd: GPU Numpy
A Thursday 10 September 2009 11:40:48 Sturla Molden escrigué: Francesc Alted skrev: Numexpr already uses the Python parser, instead of build a new one. However the bytecode emitted after the compilation process is different, of course. Also, I don't see the point in requiring immutable buffers. Could you develop this further? If you do lacy evaluation, a function like this could fail without immutable buffers: def foobar(x): y = a*x[:] + b x[0] = 0 # affects y and anything else depending on x return y Immutable buffers are not required, one could document the oddity, but coding would be very error-prone. Mmh, I don't see a problem here if operation's order is kept untouched (and you normally want to do this). But I'm not an expert on 'lazy evaluation', so may want to ignore my comments better ;-) -- Francesc Alted ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Fwd: GPU Numpy
A Thursday 10 September 2009 11:37:24 Gael Varoquaux escrigué: On Thu, Sep 10, 2009 at 11:29:49AM +0200, Francesc Alted wrote: The point is: are GPUs prepared to compete with a general-purpose CPUs in all-road operations, like evaluating transcendental functions, conditionals all of this with a rich set of data types? I would like to believe that this is the case, but I don't think so (at least not yet). I believe (this is very foggy) that GPUs can implement non trivial logic on there base processing unit, so that conditionals and transcendental functions are indeed possible. Where it gets hard is when you don't have problems that can be expressed in an embarrassingly parallel manner. But NumPy is about embarrassingly parallel calculations, right? I mean: a = np.cos(b) where b is a 1x1 matrix is *very* embarrassing (in the parallel meaning of the term ;-) Anyone here can say how the above operation can be done with GPUs? (and providing some timings would be really great :) -- Francesc Alted ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Fwd: GPU Numpy
a = np.cos(b) where b is a 1x1 matrix is *very* embarrassing (in the parallel meaning of the term ;-) On this operation, gpu's will eat up cpu's like a pack of pirhanas. :) -- Rohit Garg http://rpg-314.blogspot.com/ Senior Undergraduate Department of Physics Indian Institute of Technology Bombay ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Fwd: GPU Numpy
That's nice to see. I think I'll change my mind if someone could perform a vector-vector multiplication (a operation that is typically memory-bounded) You mean a dot product? -- Rohit Garg http://rpg-314.blogspot.com/ Senior Undergraduate Department of Physics Indian Institute of Technology Bombay ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Fwd: GPU Numpy
A Thursday 10 September 2009 14:36:16 Rohit Garg escrigué: That's nice to see. I think I'll change my mind if someone could perform a vector-vector multiplication (a operation that is typically memory-bounded) You mean a dot product? Whatever, dot product or element-wise product. Both are memory-bounded. -- Francesc Alted ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Fwd: GPU Numpy
On 09/10/2009 07:40 AM, Francesc Alted wrote: A Thursday 10 September 2009 14:36:16 Rohit Garg escrigué: That's nice to see. I think I'll change my mind if someone could perform a vector-vector multiplication (a operation that is typically memory-bounded) You mean a dot product? Whatever, dot product or element-wise product. Both are memory-bounded. -- Francesc Alted As Francesc previous said, these need to be at least in double precision and really it should also be in all the floating point precisions used by numpy on supported platforms. Based on the various boinc project comments, many graphics cards do not natively support double precision so you can get an inflated speedup just because of the difference in precision. Bruce ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Fwd: GPU Numpy
Apart from float and double, which floating point formats are supported by numpy? On Thu, Sep 10, 2009 at 7:09 PM, Bruce Southey bsout...@gmail.com wrote: On 09/10/2009 07:40 AM, Francesc Alted wrote: A Thursday 10 September 2009 14:36:16 Rohit Garg escrigué: That's nice to see. I think I'll change my mind if someone could perform a vector-vector multiplication (a operation that is typically memory-bounded) You mean a dot product? Whatever, dot product or element-wise product. Both are memory-bounded. -- Francesc Alted As Francesc previous said, these need to be at least in double precision and really it should also be in all the floating point precisions used by numpy on supported platforms. Based on the various boinc project comments, many graphics cards do not natively support double precision so you can get an inflated speedup just because of the difference in precision. Bruce ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion -- Rohit Garg http://rpg-314.blogspot.com/ Senior Undergraduate Department of Physics Indian Institute of Technology Bombay ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Fwd: GPU Numpy
A Thursday 10 September 2009 15:51:15 Rohit Garg escrigué: Apart from float and double, which floating point formats are supported by numpy? I think whatever supported by the underlying CPU, whenever it is extended double precision (12 bytes) or quad precision (16 bytes). -- Francesc Alted ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Fwd: GPU Numpy
I think whatever supported by the underlying CPU, whenever it is extended double precision (12 bytes) or quad precision (16 bytes). classic 64 bit cpu's support neither. -- Francesc Alted ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion -- Rohit Garg http://rpg-314.blogspot.com/ Senior Undergraduate Department of Physics Indian Institute of Technology Bombay ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Fwd: GPU Numpy
On Thu, Sep 10, 2009 at 07:28, Francesc Altedfal...@pytables.org wrote: A Thursday 10 September 2009 11:37:24 Gael Varoquaux escrigué: On Thu, Sep 10, 2009 at 11:29:49AM +0200, Francesc Alted wrote: The point is: are GPUs prepared to compete with a general-purpose CPUs in all-road operations, like evaluating transcendental functions, conditionals all of this with a rich set of data types? I would like to believe that this is the case, but I don't think so (at least not yet). I believe (this is very foggy) that GPUs can implement non trivial logic on there base processing unit, so that conditionals and transcendental functions are indeed possible. Where it gets hard is when you don't have problems that can be expressed in an embarrassingly parallel manner. But NumPy is about embarrassingly parallel calculations, right? I mean: a = np.cos(b) where b is a 1x1 matrix is *very* embarrassing (in the parallel meaning of the term ;-) Yes. However, it is worth making the distinction between embarrassingly parallel problems and SIMD problems. Not all embarrassingly parallel problems are SIMD-capable. GPUs do SIMD, not generally embarrassing problems. If there are branches, as would be necessary for many special functions, the GPU does not perform as well. Basically, every unit has to do both branches because they all must do the same instruction at the same time, even though the data on each unit only gets processed by one branch. cos() is easy. Or at least is so necessary to graphics computing that it is already a primitive in all (most?) GPU languages. Googling around shows SIMD code for the basic transcendental functions. I believe you have to code them differently than you would on a CPU. Other special functions would simply be hard to do efficiently. -- Robert Kern I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth. -- Umberto Eco ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Fwd: GPU Numpy
Yes. However, it is worth making the distinction between embarrassingly parallel problems and SIMD problems. Not all embarrassingly parallel problems are SIMD-capable. GPUs do SIMD, not generally embarrassing problems. GPUs exploit both dimensions of parallelism, both simd (aka vectorization) and parallelization (aka multicore). And yeah, 99.9% of the time branching on GPU should be the least/last of your worries if your problem is data-parallel. There are much worse things than branchings. As for SIMD special functions, branching can certainly be eliminated. I have written/come across some special functions myself, and I do not know any case which is difficult to do efficiently on a gpu. Certainly, I know less than some folks around here. May be you can contribute a counter example to this discussion. Regards, -- Rohit Garg http://rpg-314.blogspot.com/ Senior Undergraduate Department of Physics Indian Institute of Technology Bombay ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Fwd: GPU Numpy
A Tuesday 08 September 2009 21:19:05 George Dahl escrigué: Sturla Molden sturla at molden.no writes: Erik Tollerud skrev: NumPy arrays on the GPU memory is an easy task. But then I would have to write the computation in OpenCL's dialect of C99? This is true to some extent, but also probably difficult to do given the fact that paralellizable algorithms are generally more difficult to formulate in striaghtforward ways. Then you have misunderstood me completely. Creating an ndarray that has a buffer in graphics memory is not too difficult, given that graphics memory can be memory mapped. This has nothing to do with parallelizable algorithms or not. It is just memory management. We could make an ndarray subclass that quickly puts is content in a buffer accessible to the GPU. That is not difficult. But then comes the question of what you do with it. I think many here misunderstands the issue here: Teraflops peak performance of modern GPUs is impressive. But NumPy cannot easily benefit from that. In fact, there is little or nothing to gain from optimising in that end. In order for a GPU to help, computation must be the time-limiting factor. It is not. There is not more to say about using GPUs in NumPy right now. Take a look at the timings here: http://www.scipy.org/PerformancePython It shows that computing with NumPy is more than ten times slower than using plain C. This is despite NumPy being written in C. The NumPy code does not incur 10 times more floating point operations than the C code. The floating point unit does not run in turtle mode when using NumPy. NumPy's relative slowness compared to C has nothing to do with floating point computation. It is due to inferior memory use (temporary buffers, multiple buffer traversals) and memory access being slow. Moving computation to the GPU can only make this worse. Improved memory usage - e.g. through lazy evaluation and JIT compilaton of expressions - can give up to a tenfold increase in performance. That is where we must start optimising to get a faster NumPy. Incidentally, this will also make it easier to leverage on modern GPUs. Sturla Molden I know that for my work, I can get around an order of a 50-fold speedup over numpy using a python wrapper for a simple GPU matrix class. So I might be dealing with a lot of matrix products where I multiply a fixed 512 by 784 matrix by a 784 by 256 matrix that changes between each matrix product, although to really see the largest gains I use a 4096 by 2048 matrix times a bunch of 2048 by 256 matrices. If all I was doing were those matrix products, it would be even faster, but what I actually am doing is a matrix product, then adding a column vector to the result, then applying an elementwise logistic sigmoid function and potentially generating a matrix of pseudorandom numbers the same shape as my result (although not always). When I do these sorts of workloads, my python numpy+GPU matrix class goes so much faster than anything that doesn't use the GPU (be it Matlab, or numpy, or C/C++ whatever) that I don't even bother measuring the speedups precisely. In some cases, my python code isn't making too many temporaries since what it is doing is so simple, but in other cases that is obviously slowing it down a bit. I have relatively complicated jobs that used to take weeks on the CPU can now take hours or days. Obviously improved memory usage would be more helpful since not everyone has access to the sorts of GPUs I use, but tenfold increases in performance seem like chump change compared to what I see with the sorts of workloads I do. 50-fold increases over NumPy+[Atlas|MKL] are really impressive. However, the point is that these speed-ups can be achieved only when the ratio of operations per element is really huge. Matrix-matrix multiplication (your example above) is a paradigmatic example of these scenarios, where computations are O(3) (or little smaller than 3, when optimized algorithms are used), while memory access is O(2). Of course, when the matrices are large, the ratio operations/elements is larger, allowing much better speed-ups; this is why GPUs really do a good job here. The point here is that matrix-matrix multiplications (or, in general, functions with a large operation/element ratio) are a *tiny* part of all the possible operations between arrays that NumPy supports. This is why Sturla is saying that it is not a good idea to include support of GPUs in all parts of NumPy. A much better strategy is to give NumPy the possibility to link with external packages (à la BLAS, LAPACK, Atlas, MKL) that can leverage the powerful GPUs for specific problems (e.g. matrix-matrix multiplications). -- Francesc Alted ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Fwd: GPU Numpy
A Tuesday 08 September 2009 23:21:53 Christopher Barker escrigué: Also, perhaps a GPU-aware numexpr could be helpful which I think is the kind of thing that Sturla was refering to when she wrote: Incidentally, this will also make it easier to leverage on modern GPUs. Numexpr mainly supports functions that are meant to be used element-wise, so the operation/element ratio is normally 1 (or close to 1). In these scenarios is where improved memory access is much more important than CPU (or, for that matter, GPU), and is the reason why numexpr is much more efficient than NumPy when evaluating complex expressions like ``a*b+c*sqrt(d)``. In other words, a GPU-enabled numexpr makes little sense. -- Francesc Alted ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Fwd: GPU Numpy
Received from Francesc Alted on Wed, Sep 09, 2009 at 05:18:48AM EDT: (snip) The point here is that matrix-matrix multiplications (or, in general, functions with a large operation/element ratio) are a *tiny* part of all the possible operations between arrays that NumPy supports. This is why Sturla is saying that it is not a good idea to include support of GPUs in all parts of NumPy. A much better strategy is to give NumPy the possibility to link with external packages (à la BLAS, LAPACK, Atlas, MKL) that can leverage the .. and CULA: http://www.culatools.com/ powerful GPUs for specific problems (e.g. matrix-matrix multiplications). L.G. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Fwd: GPU Numpy
A Wednesday 09 September 2009 11:26:06 Francesc Alted escrigué: A Tuesday 08 September 2009 23:21:53 Christopher Barker escrigué: Also, perhaps a GPU-aware numexpr could be helpful which I think is the kind of thing that Sturla was refering to when she wrote: Incidentally, this will also make it easier to leverage on modern GPUs. Numexpr mainly supports functions that are meant to be used element-wise, so the operation/element ratio is normally 1 (or close to 1). In these scenarios is where improved memory access is much more important than CPU (or, for that matter, GPU), and is the reason why numexpr is much more efficient than NumPy when evaluating complex expressions like ``a*b+c*sqrt(d)``. In other words, a GPU-enabled numexpr makes little sense. Er, I forgot the fact that one exception to operation/element ratio being normally 1 in numexpr is the computation of transcendental functions (trigonometrical, exponential, logarithmic...) where the number of CPU operations per element is much larger than 1 (normally in the 100s). Right now, there is support for accelerating them in numexpr via VML (Intel's Vector Math Library), but I suppose that a library making use of a GPU would be very interesting too (and the same applies to numpy). But again, it makes more sense to rely on external packages or libraries (similar to the VML above) for this sort of things. After having a look at CULA (thanks for the pointer, Lev!), my hope is that in short we will see other libraries allowing for efficient evaluation of transcendental functions using GPUs too. -- Francesc Alted ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Fwd: GPU Numpy
On Wed, Sep 9, 2009 at 10:41 AM, Francesc Alted fal...@pytables.org wrote: Numexpr mainly supports functions that are meant to be used element-wise, so the operation/element ratio is normally 1 (or close to 1). In these scenarios is where improved memory access is much more important than CPU (or, for that matter, GPU), and is the reason why numexpr is much more efficient than NumPy when evaluating complex expressions like ``a*b+c*sqrt(d)``. In other words, a GPU-enabled numexpr makes little sense. There's another way of looking at this, which has been mentioned before in the conversation, but which I think should be mentioned again... The cost of transfer to and from a GPU is very high, compared with most of the sorts of things that we do with ndarrays. So the approach of using libraries to speed up little pieces here and there (i.e. with VML or ATLAS) but basically to let stock numpy take care of the rest does not work. In order to benefit from huge speedups on a GPU, data need to be on the GPU already. It is a good idea to perform low-instruction density functions on the GPU even when the CPU could go just as fast (or even if the CPU is faster!) just to ensure that the data stay on the GPU. Suppose you want to evaluate dot(a*b+c*sqrt(d), e). The GPU is great for doing dot(), but if you have to copy the result of the elemwise expression to the GPU before you can start doing dot(), then the performance advantage is ruined. Except for huge matrices, you might as well just leave the data in the system RAM and use a normal BLAS library. So that's why it is a good idea to use the GPU to do some functions even when the CPU would be faster for them (in isolation). All that said, there is a possibility that future devices (and some laptops already?) will use an integrated memory system that might make 'copying to the GPU' a non-issue... but we're not there yet I think... James -- http://www-etud.iro.umontreal.ca/~bergstrj ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Fwd: GPU Numpy
Christopher Barker wrote: George Dahl wrote: Sturla Molden sturla at molden.no writes: Teraflops peak performance of modern GPUs is impressive. But NumPy cannot easily benefit from that. I know that for my work, I can get around an order of a 50-fold speedup over numpy using a python wrapper for a simple GPU matrix class. I think you're talking across each other here. Sturla is referring to making a numpy ndarray gpu-aware and then expecting expressions like: z = a*x**2 + b*x + c to go faster when s, b, c, and x are ndarrays. That's not going to happen. On the other hand, George is talking about moving higher-level operations (like a matrix product) over to GPU code. This is analogous to numpy.linalg and numpy.dot() using LAPACK routines, and yes, that could help those programs that use such operations. So a GPU LAPACK would be nice. This is also analogous to using SWIG, or ctypes or cython or weave, or ??? to move a computationally expensive part of the code over to C. I think anything that makes it easier to write little bits of your code for the GPU would be pretty cool -- a GPU-aware Cython? Cython is probably open for that if anybody's interested in implementing it/make a student project on it (way too big for GSoC I think, unfortunately). However I'd definitely make it a generic library turning expressions into compiled code (either GPU or CPU w/SSE); that could then be used both at compile-time from Cython, or at run-time using e.g. SymPy or SAGE expressions. Both PyCUDA and CorePy would tend to allow both compile-time operation and run-time operation. -- Dag Sverre ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Fwd: GPU Numpy
George Dahl skrev: I know that for my work, I can get around an order of a 50-fold speedup over numpy using a python wrapper for a simple GPU matrix class. So I might be dealing with a lot of matrix products where I multiply a fixed 512 by 784 matrix by a 784 by 256 matrix that changes between each matrix product, although to really see the largest gains I use a 4096 by 2048 matrix times a bunch of 2048 by 256 matrices. Matrix multiplication is at the core of 3D graphics, and the raison d'etre for GPUs. That is specifically what they are designed to do. Matrix multiplication scale O(n**3) with floating point operations and O(n**2) with memory access. That is GPUs gives fast 3D graphics (matrix multiplications) by speeding up floating point operations. GPUs makes sence for certain level-3 BLAS calls, but that really belongs in BLAS, not in NumPy's core. One could e.g. consider linking with a BLAS wrapper that directs these special cases to the GPU and the rest to ATLAS / MKL / netlib BLAS. Sturla Molden ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Fwd: GPU Numpy
James Bergstra skrev: Suppose you want to evaluate dot(a*b+c*sqrt(d), e). The GPU is great for doing dot(), The CPU is equally great (or better?) for doing dot(). In both cases: - memory access scale O(n) for dot producs. - computation scale O(n) for dot producs. - memory is low - computation is fast (faster for GPU) In both cases, the floating point unit is starved. That means it could do a lot more work if memory were faster. For the GPU to be faster than CPU, you have to have a situation where computation dominates over memory access. Matrix-matrix multiplication is one such example. This is what GPUs are designed to do, as it is the major bootleneck in 3D graphics. The proper way to speed up dot(a*b+c*sqrt(d), e) is to get rid of temporary intermediates. That is, in Python pseudo-code: result = 0 for i in range(n): result += (a[i]*b[i] + c[i]*sqrt(d[i])) * e[i] instead of: tmp0 = empty(n) for i in range(n): tmp0[i] = a[i] * b[i] tmp1 = empty(n) for i in range(n): tmp1[i] = sqrt(d[i]) tmp2 = empty(n) for i in range(n): tmp2[i] = c[i] * tmp1[i] tmp3 = empty(n) for i in range(n): tmp3[i] = tmp0[i] + tmp2[i] result = 0 for i in range(n): result += tmp3[i] * e[i] It is this complication that makes NumPy an order of magnitude slower than hand-crafted C (but still much faster than pure Python!) Adding in GPUs will not change this. The amount of computation (flop count) is the same, so it cannot be the source of the slowness. Sturla Molden ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Fwd: GPU Numpy
On 10-Sep-09, at 12:47 AM, Sturla Molden wrote: The CPU is equally great (or better?) for doing dot(). In both cases: - memory access scale O(n) for dot producs. - computation scale O(n) for dot producs. - memory is low - computation is fast (faster for GPU) You do realize that the throughput from onboard (video) RAM is going to be much higher, right? It's not just the parallelization but the memory bandwidth. And as James pointed out, if you can keep most of your intermediate computation on-card, you stand to benefit immensely, even if doing some operations where the GPU provides no tangible benefit (i.e. the benefit is in aggregate and avoiding copies). FWIW I agree with you that NumPy isn't the place for GPU stuff to happen. In the short to medium term we need a way to make it simpler for naturally expressed computations not go hog wild with temporary allocations (it's a very hard problem given the constraints of the language). In the long term I envision something with flexible enough machinery to be manipulating objects in GPU memory with the same ease as in main memory, but I think the path to that lies in increasing the generality and flexibility of the interfaces exposed. David ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Fwd: GPU Numpy
On Wed, Sep 9, 2009 at 9:47 PM, Sturla Moldenstu...@molden.no wrote: James Bergstra skrev: Suppose you want to evaluate dot(a*b+c*sqrt(d), e). The GPU is great for doing dot(), The CPU is equally great (or better?) for doing dot(). In both cases: - memory access scale O(n) for dot producs. - computation scale O(n) for dot producs. Remember that we have a little terminology ambiguity here: in numpy, dot(a,b) is used to describe both the vector dot product, an O(n) operation if a and b are n-element vectors, and the matrix product, an O(n**3) operation if a and b are both nxn square matrices. Just a clarification... Cheers, f ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Fwd: GPU Numpy
George Dahl wrote: Sturla Molden sturla at molden.no writes: Teraflops peak performance of modern GPUs is impressive. But NumPy cannot easily benefit from that. I know that for my work, I can get around an order of a 50-fold speedup over numpy using a python wrapper for a simple GPU matrix class. I think you're talking across each other here. Sturla is referring to making a numpy ndarray gpu-aware and then expecting expressions like: z = a*x**2 + b*x + c to go faster when s, b, c, and x are ndarrays. That's not going to happen. On the other hand, George is talking about moving higher-level operations (like a matrix product) over to GPU code. This is analogous to numpy.linalg and numpy.dot() using LAPACK routines, and yes, that could help those programs that use such operations. So a GPU LAPACK would be nice. This is also analogous to using SWIG, or ctypes or cython or weave, or ??? to move a computationally expensive part of the code over to C. I think anything that makes it easier to write little bits of your code for the GPU would be pretty cool -- a GPU-aware Cython? Also, perhaps a GPU-aware numexpr could be helpful which I think is the kind of thing that Sturla was refering to when she wrote: Incidentally, this will also make it easier to leverage on modern GPUs. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Fwd: GPU Numpy
Sturla Molden sturla at molden.no writes: Erik Tollerud skrev: NumPy arrays on the GPU memory is an easy task. But then I would have to write the computation in OpenCL's dialect of C99? This is true to some extent, but also probably difficult to do given the fact that paralellizable algorithms are generally more difficult to formulate in striaghtforward ways. Then you have misunderstood me completely. Creating an ndarray that has a buffer in graphics memory is not too difficult, given that graphics memory can be memory mapped. This has nothing to do with parallelizable algorithms or not. It is just memory management. We could make an ndarray subclass that quickly puts is content in a buffer accessible to the GPU. That is not difficult. But then comes the question of what you do with it. I think many here misunderstands the issue here: Teraflops peak performance of modern GPUs is impressive. But NumPy cannot easily benefit from that. In fact, there is little or nothing to gain from optimising in that end. In order for a GPU to help, computation must be the time-limiting factor. It is not. There is not more to say about using GPUs in NumPy right now. Take a look at the timings here: http://www.scipy.org/PerformancePython It shows that computing with NumPy is more than ten times slower than using plain C. This is despite NumPy being written in C. The NumPy code does not incur 10 times more floating point operations than the C code. The floating point unit does not run in turtle mode when using NumPy. NumPy's relative slowness compared to C has nothing to do with floating point computation. It is due to inferior memory use (temporary buffers, multiple buffer traversals) and memory access being slow. Moving computation to the GPU can only make this worse. Improved memory usage - e.g. through lazy evaluation and JIT compilaton of expressions - can give up to a tenfold increase in performance. That is where we must start optimising to get a faster NumPy. Incidentally, this will also make it easier to leverage on modern GPUs. Sturla Molden I know that for my work, I can get around an order of a 50-fold speedup over numpy using a python wrapper for a simple GPU matrix class. So I might be dealing with a lot of matrix products where I multiply a fixed 512 by 784 matrix by a 784 by 256 matrix that changes between each matrix product, although to really see the largest gains I use a 4096 by 2048 matrix times a bunch of 2048 by 256 matrices. If all I was doing were those matrix products, it would be even faster, but what I actually am doing is a matrix product, then adding a column vector to the result, then applying an elementwise logistic sigmoid function and potentially generating a matrix of pseudorandom numbers the same shape as my result (although not always). When I do these sorts of workloads, my python numpy+GPU matrix class goes so much faster than anything that doesn't use the GPU (be it Matlab, or numpy, or C/C++ whatever) that I don't even bother measuring the speedups precisely. In some cases, my python code isn't making too many temporaries since what it is doing is so simple, but in other cases that is obviously slowing it down a bit. I have relatively complicated jobs that used to take weeks on the CPU can now take hours or days. Obviously improved memory usage would be more helpful since not everyone has access to the sorts of GPUs I use, but tenfold increases in performance seem like chump change compared to what I see with the sorts of workloads I do. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Fwd: GPU Numpy
Hi everyone, In case anyone is interested, I just set up a google group to discuss GPU-based simulation for our Python neural simulator Brian: http://groups.google.fr/group/brian-on-gpu Our simulator relies heavily Numpy. I would be very happy if the GPU experts here would like to share their expertise. Best, Romain Romain Brette a écrit : Sturla Molden a écrit : Thus, here is my plan: 1. a special context-manager class 2. immutable arrays inside with statement 3. lazy evaluation: expressions build up a parse tree 4. dynamic code generation 5. evaluation on exit There seems to be some similarity with what we want to do to accelerate our neural simulations (briansimulator.org), as described here: http://brian.svn.sourceforge.net/viewvc/brian/trunk/dev/BEPs/BEP-9-Automatic%20code%20generation.txt?view=markup (by the way BEP is Brian Enhancement Proposal) The speed-up factor we got in our experimental code with GPU is very substantial when there are many neurons (= large vectors, e.g. 10 000 elements), even when operations are simple. Romain ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Fwd: GPU Numpy
Erik Tollerud skrev: NumPy arrays on the GPU memory is an easy task. But then I would have to write the computation in OpenCL's dialect of C99? This is true to some extent, but also probably difficult to do given the fact that paralellizable algorithms are generally more difficult to formulate in striaghtforward ways. Then you have misunderstood me completely. Creating an ndarray that has a buffer in graphics memory is not too difficult, given that graphics memory can be memory mapped. This has nothing to do with parallelizable algorithms or not. It is just memory management. We could make an ndarray subclass that quickly puts is content in a buffer accessible to the GPU. That is not difficult. But then comes the question of what you do with it. I think many here misunderstands the issue here: Teraflops peak performance of modern GPUs is impressive. But NumPy cannot easily benefit from that. In fact, there is little or nothing to gain from optimising in that end. In order for a GPU to help, computation must be the time-limiting factor. It is not. There is not more to say about using GPUs in NumPy right now. Take a look at the timings here: http://www.scipy.org/PerformancePython It shows that computing with NumPy is more than ten times slower than using plain C. This is despite NumPy being written in C. The NumPy code does not incur 10 times more floating point operations than the C code. The floating point unit does not run in turtle mode when using NumPy. NumPy's relative slowness compared to C has nothing to do with floating point computation. It is due to inferior memory use (temporary buffers, multiple buffer traversals) and memory access being slow. Moving computation to the GPU can only make this worse. Improved memory usage - e.g. through lazy evaluation and JIT compilaton of expressions - can give up to a tenfold increase in performance. That is where we must start optimising to get a faster NumPy. Incidentally, this will also make it easier to leverage on modern GPUs. Sturla Molden ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Fwd: GPU Numpy
I realize this topic is a bit old, but I couldn't help but add something I forgot to mention earlier... I mean, once the computations are moved elsewhere numpy is basically a convenient way to address memory. That is how I mostly use NumPy, though. Computations I often do in Fortran 95 or C. NumPy arrays on the GPU memory is an easy task. But then I would have to write the computation in OpenCL's dialect of C99? But I'd rather program everything in Python if I could. Details like GPU and OpenCL should be hidden away. Nice looking Python with NumPy is much easier to read and write. That is why I'd like to see a code generator (i.e. JIT compiler) for NumPy. This is true to some extent, but also probably difficult to do given the fact that paralellizable algorithms are generally more difficult to formulate in striaghtforward ways. In the intermediate-term, I think there is value in having numpy implement some sort of interface to OpenCL or cuda - I can easily see an explosion of different bindings (it's already starting), and having a canonical way encoded in numpy or scipy is probably the best way to mitigate the inevitable compatibility problems... I'm partial to the way pycuda can do it (basically, just export numpy arrays to the GPU and let you write the code from there), but the main point is to just get some basic compatibility in pretty quickly, as I think this GPGPU is here to stay... ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Fwd: GPU Numpy
Sturla Molden a écrit : Thus, here is my plan: 1. a special context-manager class 2. immutable arrays inside with statement 3. lazy evaluation: expressions build up a parse tree 4. dynamic code generation 5. evaluation on exit There seems to be some similarity with what we want to do to accelerate our neural simulations (briansimulator.org), as described here: http://brian.svn.sourceforge.net/viewvc/brian/trunk/dev/BEPs/BEP-9-Automatic%20code%20generation.txt?view=markup (by the way BEP is Brian Enhancement Proposal) The speed-up factor we got in our experimental code with GPU is very substantial when there are many neurons (= large vectors, e.g. 10 000 elements), even when operations are simple. Romain ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Fwd: GPU Numpy
On Thu, Aug 6, 2009 at 11:12 AM, James Bergstra bergs...@iro.umontreal.cawrote: David Warde-Farley dwf at cs.toronto.edu writes: It did inspire some of our colleagues in Montreal to create this, though: http://code.google.com/p/cuda-ndarray/ I gather it is VERY early in development, but I'm sure they'd love contributions! Hi David, That does look quite close to what I imagined, probably a good start then! Romain Hi, I'm one of the devs for that project. Thanks David for the link. I put some text on the homepage so it's a little more self-explanatory. We do welcome contributions. I feel like I must be reinventing the wheel on this, so I'd really appreciate it if someone who knows of a similar project would let me know about it. Otherwise we'll keep plugging away at replicating core ndarray interface elements (operators, math.h-type functions, array indexing, etc.) http://code.google.com/p/cuda-ndarray/ I almost looks like you are reimplementing numpy, in c++ no less. Is there any reason why you aren't working with a numpy branch and just adding ufuncs? I'm also curious if you have thoughts about how to use the GPU pipelines in parallel. Chuck ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Fwd: GPU Numpy
On Thu, Aug 6, 2009 at 1:19 PM, Charles R Harrischarlesr.har...@gmail.com wrote: I almost looks like you are reimplementing numpy, in c++ no less. Is there any reason why you aren't working with a numpy branch and just adding ufuncs? I don't know how that would work. The Ufuncs need a datatype to work with, and AFAIK, it would break everything if a numpy ndarray pointed to memory on the GPU. Could you explain what you mean a little more? I'm also curious if you have thoughts about how to use the GPU pipelines in parallel. Current thinking for ufunc type computations: 1) divide up the tensors into subtensors whose dimensions have power-of-two sizes (this permits a fast integer - ndarray coordinate computation using bit shifting), 2) launch a kernel for each subtensor in it's own stream to use parallel pipelines. 3) sync and return. This is a pain to do without automatic code generation though. Currently we're using macros, but that's not pretty. C++ has templates, which we don't really use yet, but were planning on using. These have some power to generate code. The 'theano' project (www.pylearn.org/theano) for which cuda-ndarray was created has a more powerful code generation mechanism similar to weave. This algorithm is used in theano-cuda-ndarray. Scipy.weave could be very useful for generating code for specific shapes/ndims on demand, if weave could use nvcc. James ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Fwd: GPU Numpy
Note that this is from a user perspective, as I have no particular plan of developing the details of this implementation, but I've thought for a long time that GPU support could be great for numpy (I would also vote for OpenCL support over cuda, although conceptually they seem quite similar)... But what exactly would the large-scale plan be? One of the advantages of GPGPUs is that they are particularly suited to rather complicated paralellizable algorithms, and the numpy-level basic operations are just the simple arithmatic operations. So while I'd love to see it working, it's unclear to me exactly how much is gained at the core numpy level, especially given that it's limited to single-precision on most GPUs. Now linear algebra or FFTs on a GPU would probably be a huge boon, I'll admit - especially if it's in the form of a drop-in replacement for the numpy or scipy versions. By the way, I noticed no one mentioned the GPUArray class in pycuda (and it looks like there's something similar in the pyopencl) - seems like that's already done a fair amount of the work... http://documen.tician.de/pycuda/array.html#pycuda.gpuarray.GPUArray On Thu, Aug 6, 2009 at 10:41 AM, James Bergstra bergs...@iro.umontreal.cawrote: On Thu, Aug 6, 2009 at 1:19 PM, Charles R Harrischarlesr.har...@gmail.com wrote: I almost looks like you are reimplementing numpy, in c++ no less. Is there any reason why you aren't working with a numpy branch and just adding ufuncs? I don't know how that would work. The Ufuncs need a datatype to work with, and AFAIK, it would break everything if a numpy ndarray pointed to memory on the GPU. Could you explain what you mean a little more? I'm also curious if you have thoughts about how to use the GPU pipelines in parallel. Current thinking for ufunc type computations: 1) divide up the tensors into subtensors whose dimensions have power-of-two sizes (this permits a fast integer - ndarray coordinate computation using bit shifting), 2) launch a kernel for each subtensor in it's own stream to use parallel pipelines. 3) sync and return. This is a pain to do without automatic code generation though. Currently we're using macros, but that's not pretty. C++ has templates, which we don't really use yet, but were planning on using. These have some power to generate code. The 'theano' project (www.pylearn.org/theano) for which cuda-ndarray was created has a more powerful code generation mechanism similar to weave. This algorithm is used in theano-cuda-ndarray. Scipy.weave could be very useful for generating code for specific shapes/ndims on demand, if weave could use nvcc. James ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Fwd: GPU Numpy
2009/8/6 Erik Tollerud erik.tolle...@gmail.com: Note that this is from a user perspective, as I have no particular plan of developing the details of this implementation, but I've thought for a long time that GPU support could be great for numpy (I would also vote for OpenCL support over cuda, although conceptually they seem quite similar)... But what exactly would the large-scale plan be? One of the advantages of GPGPUs is that they are particularly suited to rather complicated paralellizable algorithms, You mean simple parallizable algorithms, I suppose? and the numpy-level basic operations are just the simple arithmatic operations. So while I'd love to see it working, it's unclear to me exactly how much is gained at the core numpy level, especially given that it's limited to single-precision on most GPUs. Now linear algebra or FFTs on a GPU would probably be a huge boon, I'll admit - especially if it's in the form of a drop-in replacement for the numpy or scipy versions. By the way, I noticed no one mentioned the GPUArray class in pycuda (and it looks like there's something similar in the pyopencl) - seems like that's already done a fair amount of the work... http://documen.tician.de/pycuda/array.html#pycuda.gpuarray.GPUArray On Thu, Aug 6, 2009 at 10:41 AM, James Bergstra bergs...@iro.umontreal.ca wrote: On Thu, Aug 6, 2009 at 1:19 PM, Charles R Harrischarlesr.har...@gmail.com wrote: I almost looks like you are reimplementing numpy, in c++ no less. Is there any reason why you aren't working with a numpy branch and just adding ufuncs? I don't know how that would work. The Ufuncs need a datatype to work with, and AFAIK, it would break everything if a numpy ndarray pointed to memory on the GPU. Could you explain what you mean a little more? I'm also curious if you have thoughts about how to use the GPU pipelines in parallel. Current thinking for ufunc type computations: 1) divide up the tensors into subtensors whose dimensions have power-of-two sizes (this permits a fast integer - ndarray coordinate computation using bit shifting), 2) launch a kernel for each subtensor in it's own stream to use parallel pipelines. 3) sync and return. This is a pain to do without automatic code generation though. Currently we're using macros, but that's not pretty. C++ has templates, which we don't really use yet, but were planning on using. These have some power to generate code. The 'theano' project (www.pylearn.org/theano) for which cuda-ndarray was created has a more powerful code generation mechanism similar to weave. This algorithm is used in theano-cuda-ndarray. Scipy.weave could be very useful for generating code for specific shapes/ndims on demand, if weave could use nvcc. James ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion -- Information System Engineer, Ph.D. Website: http://matthieu-brucher.developpez.com/ Blogs: http://matt.eifelle.com and http://blog.developpez.com/?blog=92 LinkedIn: http://www.linkedin.com/in/matthieubrucher ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Fwd: GPU Numpy
On 6-Aug-09, at 2:54 PM, Erik Tollerud wrote: Now linear algebra or FFTs on a GPU would probably be a huge boon, I'll admit - especially if it's in the form of a drop-in replacement for the numpy or scipy versions. The word I'm hearing from people in my direct acquaintance who are using it is that if you have code that even do lots of matrix multiplies, nevermind solving systems or anything like that, the speedup is several orders of magnitude. Things that used to take weeks now take a day or two. If you can deal with the loss of precision it's really quite worth it. By the way, I noticed no one mentioned the GPUArray class in pycuda (and it looks like there's something similar in the pyopencl) - seems like that's already done a fair amount of the work... http://documen.tician.de/pycuda/array.html#pycuda.gpuarray.GPUArray This seems like a great start, I agree. The lack of any documentation on 'dot' is worrying, though. David ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Fwd: GPU Numpy
Now linear algebra or FFTs on a GPU would probably be a huge boon, I'll admit - especially if it's in the form of a drop-in replacement for the numpy or scipy versions. NumPy generate temporary arrays for expressions involving ndarrays. This extra allocation and copying often takes more time than the computation. With GPGPUs, we have to bus the data to and from VRAM as well. D. Knuth quoted Hoare saying that premature optimization is the root of all evil. Optimizing computation when the bottleneck is memory is premature. In order to improve on this, I think we have to add lazy evaluation to NumPy. That is, an operator should not return a temporary array but a symbolic expression. So if we have an expression like y = a*x + b it should not evalute a*x into a temporary array. Rather, the operators would build up a parse tree like y = add(multiply(a,x),b) and evalute the whole expression later on. This would require two things: First we need dynamic code generation, which incidentally is what OpenCL is all about. I.e. OpenCL is dynamically invoked compiler; there is a function clCreateProgramFromSource, which does just what it says. Second, we need arrays to be immutable. This is very important. If arrays are not immutable, code like this could fail: y = a*x + b x[0] = 1235512371235 With lazy evaluation, the memory overhead would be much smaller. The GPGPU would also get a more complex expressions to use as a kernels. There should be an option of running this on the CPU, possibly using OpenMP for multi-threading. We could either depend on a compiler (C or Fortran) being installed, or use opcodes for a dedicated virtual machine (cf. what numexpr does). In order to reduce the effect of immutable arrays, we could introduce a context-manager. Inside the with statement, all arrays would be immutable. Second, the __exit__ method could trigger the code generator and do all the evaluation. So we would get something like this: # normal numpy here with numpy.accelerator(): # arrays become immutable # lazy evaluation # code generation and evaluation on exit # normal numpy continues here Thus, here is my plan: 1. a special context-manager class 2. immutable arrays inside with statement 3. lazy evaluation: expressions build up a parse tree 4. dynamic code generation 5. evaluation on exit I guess it is possibly to find ways to speed up this as well. If a context manager would always generate the same OpenCL code, the with statement would only need to execute once (we could raise an exception on enter to jump directly to exit). It is possibly to create a superfast NumPy. But just plugging GPGPUs into the current design would be premature. In NumPy's current state, with mutable ndarrays and operators generating temporary arrays, there is not much to gain from introducing GPGPUs. It would only be beneficial in computationally demanding parts like FFTs and solvers for linear algebra and differential equations. Ufuncs with trancendental functions might also benefit. SciPy would certainly benefit more from GPGPUs than NumPy. Just my five cents :-) Regards, Sturla Molden ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Fwd: GPU Numpy
On Thu, Aug 6, 2009 at 15:57, Sturla Moldenstu...@molden.no wrote: Now linear algebra or FFTs on a GPU would probably be a huge boon, I'll admit - especially if it's in the form of a drop-in replacement for the numpy or scipy versions. NumPy generate temporary arrays for expressions involving ndarrays. This extra allocation and copying often takes more time than the computation. With GPGPUs, we have to bus the data to and from VRAM as well. D. Knuth quoted Hoare saying that premature optimization is the root of all evil. Optimizing computation when the bottleneck is memory is premature. It is possibly to create a superfast NumPy. But just plugging GPGPUs into the current design would be premature. In NumPy's current state, with mutable ndarrays and operators generating temporary arrays, there is not much to gain from introducing GPGPUs. It would only be beneficial in computationally demanding parts like FFTs and solvers for linear algebra and differential equations. I believe that is exactly the point that Erik is making. :-) -- Robert Kern I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth. -- Umberto Eco ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Fwd: GPU Numpy
Robert Kern wrote: I believe that is exactly the point that Erik is making. :-) I wasn't arguing against him, just suggesting a solution. :-) I have big hopes for lazy evaluation, if we can find a way to to it right. Sturla ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Fwd: GPU Numpy
On Thu, Aug 6, 2009 at 4:57 PM, Sturla Moldenstu...@molden.no wrote: Now linear algebra or FFTs on a GPU would probably be a huge boon, I'll admit - especially if it's in the form of a drop-in replacement for the numpy or scipy versions. NumPy generate temporary arrays for expressions involving ndarrays. This extra allocation and copying often takes more time than the computation. With GPGPUs, we have to bus the data to and from VRAM as well. D. Knuth quoted Hoare saying that premature optimization is the root of all evil. Optimizing computation when the bottleneck is memory is premature. In order to improve on this, I think we have to add lazy evaluation to NumPy. That is, an operator should not return a temporary array but a symbolic expression. So if we have an expression like y = a*x + b it should not evalute a*x into a temporary array. Rather, the operators would build up a parse tree like y = add(multiply(a,x),b) and evalute the whole expression later on. [snip] Regards, Sturla Molden ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion Hi Sturla, The plan you describe is a good one, and Theano (www.pylearn.org/theano) almost exactly implements it. You should check it out. It does not use 'with' syntax at the moment, but it could provide the backend machinery for your mechanism if you want to go forward with that. Theano provides - symbolic expression building for a big subset of what numpy can do (and a few things that it doesn't) - expression optimization (for faster and more accurate computations) - dynamic code generation - cacheing of compiled functions to disk. Also, when you have a symbolic expression graph you can do cute stuff like automatic differentiation. We're currently working on the bridge between theano and cuda so that you declare certain inputs as residing on the GPU instead of the host memory, so you don't have to transfer things to and from host memory as much. James ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Fwd: GPU Numpy
On Thu, Aug 6, 2009 at 3:29 PM, James Bergstra bergs...@iro.umontreal.cawrote: On Thu, Aug 6, 2009 at 4:57 PM, Sturla Moldenstu...@molden.no wrote: Now linear algebra or FFTs on a GPU would probably be a huge boon, I'll admit - especially if it's in the form of a drop-in replacement for the numpy or scipy versions. NumPy generate temporary arrays for expressions involving ndarrays. This extra allocation and copying often takes more time than the computation. With GPGPUs, we have to bus the data to and from VRAM as well. D. Knuth quoted Hoare saying that premature optimization is the root of all evil. Optimizing computation when the bottleneck is memory is premature. In order to improve on this, I think we have to add lazy evaluation to NumPy. That is, an operator should not return a temporary array but a symbolic expression. So if we have an expression like y = a*x + b it should not evalute a*x into a temporary array. Rather, the operators would build up a parse tree like y = add(multiply(a,x),b) and evalute the whole expression later on. [snip] Regards, Sturla Molden ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion Hi Sturla, The plan you describe is a good one, and Theano (www.pylearn.org/theano) almost exactly implements it. You should check it out. It does not use 'with' syntax at the moment, but it could provide the backend machinery for your mechanism if you want to go forward with that. Theano provides - symbolic expression building for a big subset of what numpy can do (and a few things that it doesn't) - expression optimization (for faster and more accurate computations) - dynamic code generation - cacheing of compiled functions to disk. Also, when you have a symbolic expression graph you can do cute stuff like automatic differentiation. We're currently working on the bridge between theano and cuda so that you declare certain inputs as residing on the GPU instead of the host memory, so you don't have to transfer things to and from host memory as much. So what simple things could numpy implement that would help here? It almost sounds like numpy would mostly be an interface to python and the gpu would execute specialized code written and compiled for specific problems. Whether the code that gets compiled is written using lazy evaluation (ala Sturla), or is expressed some other way seems like an independent issue. It sounds like one important thing would be having arrays that reside on the GPU. Chuck ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Fwd: GPU Numpy
Charles R Harris wrote: Whether the code that gets compiled is written using lazy evaluation (ala Sturla), or is expressed some other way seems like an independent issue. It sounds like one important thing would be having arrays that reside on the GPU. Memory management is slow compared to computation. Operations like malloc, free and memcpy is not faster for VRAM than for RAM. There will be no benefit from the GPU if the bottleneck is memory. That is why we need to get rid of the creation of temporary arrays, hence lazy evaluation. Having arrays reside in VRAM would reduce the communication between RAM and VRAM, but the problem with temporary arrays is still there. Also VRAM tends to be a limited resource. Sturla ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Fwd: GPU Numpy
Sturla Molden wrote: Memory management is slow compared to computation. Operations like malloc, free and memcpy is not faster for VRAM than for RAM. Actually it's not VRAM anymore, but whatever you call the memory dedicated to the GPU. It is cheap to put 8 GB of RAM into a computer, but graphics cards with more than 1 GB memory are expensive and uncommon on e.g. laptops. And this memory will be needed for other things as well, e.g. graphics. Sturla ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Fwd: GPU Numpy
On Thu, Aug 6, 2009 at 4:36 PM, Sturla Molden stu...@molden.no wrote: Charles R Harris wrote: Whether the code that gets compiled is written using lazy evaluation (ala Sturla), or is expressed some other way seems like an independent issue. It sounds like one important thing would be having arrays that reside on the GPU. Memory management is slow compared to computation. Operations like malloc, free and memcpy is not faster for VRAM than for RAM. There will be no benefit from the GPU if the bottleneck is memory. That is why we need to get rid of the creation of temporary arrays, hence lazy evaluation. Having arrays reside in VRAM would reduce the communication between RAM and VRAM, but the problem with temporary arrays is still there. I'm not arguing with that, but I regard it as a separate problem. One could, after all, simply use an expression to GPU compiler to generate modules. The question is what simple additions we can make to numpy so that it acts as a convenient io channel. I mean, once the computations are moved elsewhere numpy is basically a convenient way to address memory. Also VRAM tends to be a limited resource. But getting less so. These days it comes in gigabytes and there is no reason why it shouldn't soon excede what many folks have for main memory. Chuck ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Fwd: GPU Numpy
Charles R Harris wrote: I mean, once the computations are moved elsewhere numpy is basically a convenient way to address memory. That is how I mostly use NumPy, though. Computations I often do in Fortran 95 or C. NumPy arrays on the GPU memory is an easy task. But then I would have to write the computation in OpenCL's dialect of C99? But I'd rather program everything in Python if I could. Details like GPU and OpenCL should be hidden away. Nice looking Python with NumPy is much easier to read and write. That is why I'd like to see a code generator (i.e. JIT compiler) for NumPy. Sturla ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Fwd: GPU Numpy
James Bergstra wrote: The plan you describe is a good one, and Theano (www.pylearn.org/theano) almost exactly implements it. You should check it out. It does not use 'with' syntax at the moment, but it could provide the backend machinery for your mechanism if you want to go forward with that. Theano provides - symbolic expression building for a big subset of what numpy can do (and a few things that it doesn't) - expression optimization (for faster and more accurate computations) - dynamic code generation - cacheing of compiled functions to disk. Thank you James, theano looks great. :-D Sturla ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Fwd: GPU Numpy
On Thu, Aug 6, 2009 at 5:10 PM, Sturla Molden stu...@molden.no wrote: Charles R Harris wrote: I mean, once the computations are moved elsewhere numpy is basically a convenient way to address memory. That is how I mostly use NumPy, though. Computations I often do in Fortran 95 or C. NumPy arrays on the GPU memory is an easy task. Glad to hear it. So maybe some way to specify and track where the memory is allocated would be helpful. Travis wants to add a dictionary to ndarrays and that might be useful here. But then I would have to write the computation in OpenCL's dialect of C99? But I'd rather program everything in Python if I could. Details like GPU and OpenCL should be hidden away. Nice looking Python with NumPy is much easier to read and write. That is why I'd like to see a code generator (i.e. JIT compiler) for NumPy. Yes, but that is a language/compiler problem. I'm thinking of what tools numpy can offer that would help people experimenting with different approaches to using GPUs. At some point we might want to adopt a working approach but now seems a bit early for that. Chuck ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Fwd: GPU Numpy
On Thu, Aug 6, 2009 at 1:57 PM, Sturla Moldenstu...@molden.no wrote: In order to reduce the effect of immutable arrays, we could introduce a context-manager. Inside the with statement, all arrays would be immutable. Second, the __exit__ method could trigger the code generator and do all the evaluation. So we would get something like this: # normal numpy here with numpy.accelerator(): # arrays become immutable # lazy evaluation # code generation and evaluation on exit # normal numpy continues here Thus, here is my plan: 1. a special context-manager class 2. immutable arrays inside with statement 3. lazy evaluation: expressions build up a parse tree 4. dynamic code generation 5. evaluation on exit You will face one issue here: unless you raise a special exception inside the with block, the python interpreter will unconditionally execute that code without your control. I had a long talk about this with Alex Martelli last year at scipy, where I pitched the idea of allowing context managers to have an optional third method, __execute__, which would get the code block in the with statement for execution. He was fairly pessimistic about the possibility of this making its way into python, mostly (if I recall correctly) because of scoping issues: the with statement does not introduce a new scope, so you'd need to pass to this method the code plus the locals/globals of the entire enclosing scope, which felt messy. There was also the thorny question of how to pass the code block. Source? Bytecode? What? In many environments the source may not be available. Last year I wrote a gross hack to do this, which you can find here: http://bazaar.launchpad.net/~ipython-dev/ipython/0.10/annotate/head%3A/IPython/kernel/contexts.py The idea is that it would be used by code like this (note, this doesn't actually work right now): def test_simple(): # XXX - for now, we need a running cluster to be started separately. The # daemon work is almost finished, and will make much of this unnecessary. from IPython.kernel import client mec = client.MultiEngineClient(('127.0.0.1',10105)) try: mec.get_ids() except ConnectionRefusedError: import os, time os.system('ipcluster -n 2 ') time.sleep(2) mec = client.MultiEngineClient(('127.0.0.1',10105)) mec.block = False parallel = RemoteMultiEngine(mec) mec.pushAll() with parallel as pr: # A comment remote() # this means the code below only runs remotely print 'Hello remote world' x = range(10) # Comments are OK # Even misindented. y = x+1 print pr.x + pr.y ### The problem with my approach is that I find it brittle and ugly enough that I ultimately abandoned it. I'd love to see if you find a proper solution for this... Cheers, f ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Fwd: GPU Numpy
On Thu, Aug 6, 2009 at 19:00, Fernando Perezfperez@gmail.com wrote: On Thu, Aug 6, 2009 at 1:57 PM, Sturla Moldenstu...@molden.no wrote: In order to reduce the effect of immutable arrays, we could introduce a context-manager. Inside the with statement, all arrays would be immutable. Second, the __exit__ method could trigger the code generator and do all the evaluation. So we would get something like this: # normal numpy here with numpy.accelerator(): # arrays become immutable # lazy evaluation # code generation and evaluation on exit # normal numpy continues here Thus, here is my plan: 1. a special context-manager class 2. immutable arrays inside with statement 3. lazy evaluation: expressions build up a parse tree 4. dynamic code generation 5. evaluation on exit You will face one issue here: unless you raise a special exception inside the with block, the python interpreter will unconditionally execute that code without your control. I had a long talk about this with Alex Martelli last year at scipy, where I pitched the idea of allowing context managers to have an optional third method, __execute__, which would get the code block in the with statement for execution. He was fairly pessimistic about the possibility of this making its way into python, mostly (if I recall correctly) because of scoping issues: the with statement does not introduce a new scope, so you'd need to pass to this method the code plus the locals/globals of the entire enclosing scope, which felt messy. Sometimes, I fantasize about writing a python4ply grammar that repurposes the `` quotes to provide expression literals and ``` ``` triple quotes for multiline statement literals. They would be literals for _ast abstract syntax trees. -- Robert Kern I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth. -- Umberto Eco ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion