Re: [Numpy-discussion] Fwd: GPU Numpy

2009-09-10 Thread Rohit Garg
 You do realize that the throughput from onboard (video) RAM is going
 to be much higher, right? It's not just the parallelization but the
 memory bandwidth. And as James pointed out, if you can keep most of
 your intermediate computation on-card, you stand to benefit immensely,
 even if doing some operations where the GPU provides no tangible
 benefit (i.e. the benefit is in aggregate and avoiding copies).

Good point made here. GPU's support bandwidth O(100 GBps) (bytes not
bits). Upcoming GPU's will likely break the 250 GBps mark. Even if
your expressions involve low operation/memory ratios, GPU's are a big
win as their memory bandwidth ishigher than CPU's L2 and even L1
caches.

Regards,

-- 
Rohit Garg

http://rpg-314.blogspot.com/

Senior Undergraduate
Department of Physics
Indian Institute of Technology
Bombay
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Fwd: GPU Numpy

2009-09-10 Thread Francesc Alted
A Thursday 10 September 2009 09:45:29 Rohit Garg escrigué:
  You do realize that the throughput from onboard (video) RAM is going
  to be much higher, right? It's not just the parallelization but the
  memory bandwidth. And as James pointed out, if you can keep most of
  your intermediate computation on-card, you stand to benefit immensely,
  even if doing some operations where the GPU provides no tangible
  benefit (i.e. the benefit is in aggregate and avoiding copies).

 Good point made here. GPU's support bandwidth O(100 GBps) (bytes not
 bits). Upcoming GPU's will likely break the 250 GBps mark. Even if
 your expressions involve low operation/memory ratios, GPU's are a big
 win as their memory bandwidth ishigher than CPU's L2 and even L1
 caches.

Where are you getting this info from?  IMO the technology of memory in 
graphics boards cannot be so different than in commercial motherboards.  It 
could be a *bit* faster (at the expenses of packing less of it), but I'd say 
not as much as 4x faster (100 GB/s vs 25 GB/s of Intel i7 in sequential 
access), as you are suggesting.  Maybe this is GPU cache bandwidth?

-- 
Francesc Alted
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Fwd: GPU Numpy

2009-09-10 Thread Rohit Garg
 Where are you getting this info from? IMO the technology of memory in
 graphics boards cannot be so different than in commercial motherboards. It
 could be a *bit* faster (at the expenses of packing less of it), but I'd say
 not as much as 4x faster (100 GB/s vs 25 GB/s of Intel i7 in sequential
 access), as you are suggesting. Maybe this is GPU cache bandwidth?

This is publicly documented. You can start off by looking at the
wikipedia stuff.

For reference,

gtx280--141GBps--has 1GB
ati4870--115GBps--has 1GB
ati5870--153GBps (launches sept 22, 2009)--2GB models will be there too

Next gen nv gpu's will *assuredly* have bandwidth in excess of 200 GBps.

This is *off chip memory bandwidth* from graphics memory (aka video
ram). GPU have (very small) caches but they don't reduce memory
latency.


 --

 Francesc Alted

 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion





-- 
Rohit Garg

http://rpg-314.blogspot.com/

Senior Undergraduate
Department of Physics
Indian Institute of Technology
Bombay
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Fwd: GPU Numpy

2009-09-10 Thread Citi, Luca
Hi Sturla,

 The proper way to speed up dot(a*b+c*sqrt(d), e) is to get rid of 
 temporary intermediates.
I implemented a patch 
http://projects.scipy.org/numpy/ticket/1153
that reduces the number of temporary intermediates.
In your example from 4 to 2.
There is a big improvement in terms of memory footprint,
and some improvement in terms of speed (especially for
large matrices) but not as much as I expected.

In your example
 result = 0
 for i in range(n):
 result += (a[i]*b[i] + c[i]*sqrt(d[i])) * e[i]
another big speedup could come from the fact that it
makes better use of the cache.

That is exactly why numexpr is faster in these cases.
I hope one day numpy will be able to perform such
optimizations.

Best,
Luca
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Fwd: GPU Numpy

2009-09-10 Thread Sturla Molden
Citi, Luca skrev:
 That is exactly why numexpr is faster in these cases.
 I hope one day numpy will be able to perform such
 optimizations.
   
I think it is going to require lazy evaluation. Whenever possible, an 
operator would just return a symbolic representation of the operation. 
This would gradually build up a tree of operators and buffers. When 
someone tries to read the data from an array, the buffer is created 
on-demand by flushing procratinated expressions. One must be sure that 
the buffers referenced in an incomplete expression never change. This 
would be easiest to ensure with immutable buffers.  Numexpr is the kind 
of  back-end a system like this would require.  But a lot of the code in 
numexpr can be omitted because Python creates the parse tree; we would 
not need the expression parser in numexpr as frontend. Well... this plan 
is gradually getting closer to a specialized SciPy JIT-compiler. I would 
be fun to make if I could find time for it.

Sturla Molden


___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Fwd: GPU Numpy

2009-09-10 Thread Sturla Molden
Rohit Garg skrev:
 gtx280--141GBps--has 1GB
 ati4870--115GBps--has 1GB
 ati5870--153GBps (launches sept 22, 2009)--2GB models will be there too
   
That is going to help if buffers are kept in graphics memory. But the 
problem is that graphics memory is a scarse resource.

S.M.






___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Fwd: GPU Numpy

2009-09-10 Thread Francesc Alted
A Thursday 10 September 2009 11:11:22 Sturla Molden escrigué:
 Citi, Luca skrev:
  That is exactly why numexpr is faster in these cases.
  I hope one day numpy will be able to perform such
  optimizations.

 I think it is going to require lazy evaluation. Whenever possible, an
 operator would just return a symbolic representation of the operation.
 This would gradually build up a tree of operators and buffers. When
 someone tries to read the data from an array, the buffer is created
 on-demand by flushing procratinated expressions. One must be sure that
 the buffers referenced in an incomplete expression never change. This
 would be easiest to ensure with immutable buffers.  Numexpr is the kind
 of  back-end a system like this would require.  But a lot of the code in
 numexpr can be omitted because Python creates the parse tree; we would
 not need the expression parser in numexpr as frontend. Well... this plan
 is gradually getting closer to a specialized SciPy JIT-compiler. I would
 be fun to make if I could find time for it.

Numexpr already uses the Python parser, instead of build a new one.  However 
the bytecode emitted after the compilation process is different, of course.

Also, I don't see the point in requiring immutable buffers.  Could you develop 
this further?

-- 
Francesc Alted
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Fwd: GPU Numpy

2009-09-10 Thread Gael Varoquaux
On Thu, Sep 10, 2009 at 10:36:27AM +0200, Francesc Alted wrote:
Where are you getting this info from? IMO the technology of memory in
graphics boards cannot be so different than in commercial motherboards. It
could be a *bit* faster (at the expenses of packing less of it), but I'd
say not as much as 4x faster (100 GB/s vs 25 GB/s of Intel i7 in
sequential access), as you are suggesting. Maybe this is GPU cache
bandwidth?

I believe this is simply because the transfers is made in parallel to the
different processing units of the graphic card. So we are back to
importance of embarrassingly parallel problems and specifying things with
high-level operations rather than for loop.

Ga�l
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Fwd: GPU Numpy

2009-09-10 Thread Francesc Alted
A Thursday 10 September 2009 10:58:13 Rohit Garg escrigué:
  Where are you getting this info from? IMO the technology of memory in
  graphics boards cannot be so different than in commercial motherboards.
  It could be a *bit* faster (at the expenses of packing less of it), but
  I'd say not as much as 4x faster (100 GB/s vs 25 GB/s of Intel i7 in
  sequential access), as you are suggesting. Maybe this is GPU cache
  bandwidth?

 This is publicly documented. You can start off by looking at the
 wikipedia stuff.

 For reference,

 gtx280--141GBps--has 1GB
 ati4870--115GBps--has 1GB
 ati5870--153GBps (launches sept 22, 2009)--2GB models will be there too

 Next gen nv gpu's will *assuredly* have bandwidth in excess of 200 GBps.

 This is *off chip memory bandwidth* from graphics memory (aka video
 ram). GPU have (very small) caches but they don't reduce memory
 latency.

That's nice to see.  I think I'll change my mind if someone could perform a 
vector-vector multiplication (a operation that is typically memory-bounded) in 
double precision up to 5x times faster on a gtx280 nv card than in a Intel's 
i7 CPU.

-- 
Francesc Alted
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Fwd: GPU Numpy

2009-09-10 Thread Francesc Alted
A Thursday 10 September 2009 11:20:21 Gael Varoquaux escrigué:
 On Thu, Sep 10, 2009 at 10:36:27AM +0200, Francesc Alted wrote:
 Where are you getting this info from? IMO the technology of memory in
 graphics boards cannot be so different than in commercial
  motherboards. It could be a *bit* faster (at the expenses of packing less
  of it), but I'd say not as much as 4x faster (100 GB/s vs 25 GB/s of
  Intel i7 in sequential access), as you are suggesting. Maybe this is GPU
  cache bandwidth?

 I believe this is simply because the transfers is made in parallel to the
 different processing units of the graphic card. So we are back to
 importance of embarrassingly parallel problems and specifying things with
 high-level operations rather than for loop.

Sure.  Specially because NumPy is all about embarrasingly parallel problems 
(after all, this is how an ufunc works, doing operations element-by-element).
The point is: are GPUs prepared to compete with a general-purpose CPUs in all-
road operations, like evaluating transcendental functions, conditionals all of 
this with a rich set of data types?  I would like to believe that this is the 
case, but I don't think so (at least not yet).

-- 
Francesc Alted
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Fwd: GPU Numpy

2009-09-10 Thread Gael Varoquaux
On Thu, Sep 10, 2009 at 11:29:49AM +0200, Francesc Alted wrote:
The point is: are GPUs prepared to compete with a general-purpose CPUs in
all-road operations, like evaluating transcendental functions,
conditionals all of this with a rich set of data types? I would like to
believe that this is the case, but I don't think so (at least not yet).

I believe (this is very foggy) that GPUs can implement non trivial logic
on there base processing unit, so that conditionals and transcendental
functions are indeed possible. Where it gets hard is when you don't have
problems that can be expressed in an embarrassingly parallel manner.
There are solutions there to (I believe of the message passing type),
after all matrix multiplication is done on GPUs.

Ga�l
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Fwd: GPU Numpy

2009-09-10 Thread Matthieu Brucher
 Sure. Specially because NumPy is all about embarrasingly parallel problems
 (after all, this is how an ufunc works, doing operations
 element-by-element).

 The point is: are GPUs prepared to compete with a general-purpose CPUs in
 all-road operations, like evaluating transcendental functions, conditionals
 all of this with a rich set of data types? I would like to believe that this
 is the case, but I don't think so (at least not yet).

A lot of nVidia's SDK functions is not done on GPU. There are some
functions that they provide where the actual computation is done on
the CPU, not on the GPU (I don't have an example here, but nVidia's
forum is full of examples ;))

Matthieu
-- 
Information System Engineer, Ph.D.
Website: http://matthieu-brucher.developpez.com/
Blogs: http://matt.eifelle.com and http://blog.developpez.com/?blog=92
LinkedIn: http://www.linkedin.com/in/matthieubrucher
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Fwd: GPU Numpy

2009-09-10 Thread Rohit Garg
 The point is: are GPUs prepared to compete with a general-purpose CPUs in
 all-road operations, like evaluating transcendental functions, conditionals
 all of this with a rich set of data types?
Yup.

-- 
Rohit Garg

http://rpg-314.blogspot.com/

Senior Undergraduate
Department of Physics
Indian Institute of Technology
Bombay
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Fwd: GPU Numpy

2009-09-10 Thread Francesc Alted
A Thursday 10 September 2009 11:40:48 Sturla Molden escrigué:
 Francesc Alted skrev:
  Numexpr already uses the Python parser, instead of build a new one.
  However the bytecode emitted after the compilation process is
  different, of course.
 
  Also, I don't see the point in requiring immutable buffers. Could you
  develop this further?

 If you do lacy evaluation, a function like this could fail without
 immutable buffers:

 def foobar(x):
 y = a*x[:] + b
 x[0] = 0 # affects y and anything else depending on x
 return y

 Immutable buffers are not required, one could document the oddity, but
 coding would be very error-prone.


Mmh, I don't see a problem here if operation's order is kept untouched (and 
you normally want to do this).  But I'm not an expert on 'lazy evaluation', so 
may want to ignore my comments better ;-)

-- 
Francesc Alted
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Fwd: GPU Numpy

2009-09-10 Thread Francesc Alted
A Thursday 10 September 2009 11:37:24 Gael Varoquaux escrigué:
 On Thu, Sep 10, 2009 at 11:29:49AM +0200, Francesc Alted wrote:
 The point is: are GPUs prepared to compete with a general-purpose CPUs
  in all-road operations, like evaluating transcendental functions,
  conditionals all of this with a rich set of data types? I would like to
  believe that this is the case, but I don't think so (at least not yet).

 I believe (this is very foggy) that GPUs can implement non trivial logic
 on there base processing unit, so that conditionals and transcendental
 functions are indeed possible. Where it gets hard is when you don't have
 problems that can be expressed in an embarrassingly parallel manner.

But NumPy is about embarrassingly parallel calculations, right?  I mean:

a = np.cos(b)

where b is a 1x1 matrix is *very* embarrassing (in the parallel 
meaning of the term ;-)

Anyone here can say how the above operation can be done with GPUs?  (and 
providing some timings would be really great :)

-- 
Francesc Alted
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Fwd: GPU Numpy

2009-09-10 Thread Rohit Garg
 a = np.cos(b)

 where b is a 1x1 matrix is *very* embarrassing (in the parallel
 meaning of the term ;-)

On this operation, gpu's will eat up cpu's like a pack of pirhanas. :)

-- 
Rohit Garg

http://rpg-314.blogspot.com/

Senior Undergraduate
Department of Physics
Indian Institute of Technology
Bombay
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Fwd: GPU Numpy

2009-09-10 Thread Rohit Garg
 That's nice to see. I think I'll change my mind if someone could perform a
 vector-vector multiplication (a operation that is typically memory-bounded)

You mean a dot product?

-- 
Rohit Garg

http://rpg-314.blogspot.com/

Senior Undergraduate
Department of Physics
Indian Institute of Technology
Bombay
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Fwd: GPU Numpy

2009-09-10 Thread Francesc Alted
A Thursday 10 September 2009 14:36:16 Rohit Garg escrigué:
  That's nice to see. I think I'll change my mind if someone could perform
  a vector-vector multiplication (a operation that is typically
  memory-bounded)

 You mean a dot product?

Whatever, dot product or element-wise product.  Both are memory-bounded.

-- 
Francesc Alted
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Fwd: GPU Numpy

2009-09-10 Thread Bruce Southey

On 09/10/2009 07:40 AM, Francesc Alted wrote:


A Thursday 10 September 2009 14:36:16 Rohit Garg escrigué:

  That's nice to see. I think I'll change my mind if someone could 
perform


  a vector-vector multiplication (a operation that is typically

  memory-bounded)



 You mean a dot product?

Whatever, dot product or element-wise product. Both are memory-bounded.

--

Francesc Alted


As Francesc previous said, these need to be at least in double precision 
and really it should also be in all the floating point precisions used 
by numpy on supported platforms. Based on the various boinc project 
comments, many graphics cards do not natively support double precision 
so  you can get an inflated speedup just because of the difference in 
precision.


Bruce
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Fwd: GPU Numpy

2009-09-10 Thread Rohit Garg
Apart from float and double, which floating point formats are
supported by numpy?

On Thu, Sep 10, 2009 at 7:09 PM, Bruce Southey bsout...@gmail.com wrote:
 On 09/10/2009 07:40 AM, Francesc Alted wrote:

 A Thursday 10 September 2009 14:36:16 Rohit Garg escrigué:

  That's nice to see. I think I'll change my mind if someone could perform

  a vector-vector multiplication (a operation that is typically

  memory-bounded)



 You mean a dot product?

 Whatever, dot product or element-wise product. Both are memory-bounded.

 --

 Francesc Alted

 As Francesc previous said, these need to be at least in double precision and
 really it should also be in all the floating point precisions used by numpy
 on supported platforms. Based on the various boinc project comments, many
 graphics cards do not natively support double precision so  you can get an
 inflated speedup just because of the difference in precision.

 Bruce

 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion





-- 
Rohit Garg

http://rpg-314.blogspot.com/

Senior Undergraduate
Department of Physics
Indian Institute of Technology
Bombay
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Fwd: GPU Numpy

2009-09-10 Thread Francesc Alted
A Thursday 10 September 2009 15:51:15 Rohit Garg escrigué:
 Apart from float and double, which floating point formats are
 supported by numpy?

I think whatever supported by the underlying CPU, whenever it is extended 
double precision (12 bytes) or quad precision (16 bytes).

-- 
Francesc Alted
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Fwd: GPU Numpy

2009-09-10 Thread Rohit Garg
 I think whatever supported by the underlying CPU, whenever it is extended
 double precision (12 bytes) or quad precision (16 bytes).

classic 64 bit cpu's support neither.

 --

 Francesc Alted

 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion





-- 
Rohit Garg

http://rpg-314.blogspot.com/

Senior Undergraduate
Department of Physics
Indian Institute of Technology
Bombay
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Fwd: GPU Numpy

2009-09-10 Thread Robert Kern
On Thu, Sep 10, 2009 at 07:28, Francesc Altedfal...@pytables.org wrote:
 A Thursday 10 September 2009 11:37:24 Gael Varoquaux escrigué:

 On Thu, Sep 10, 2009 at 11:29:49AM +0200, Francesc Alted wrote:

  The point is: are GPUs prepared to compete with a general-purpose CPUs

  in all-road operations, like evaluating transcendental functions,

  conditionals all of this with a rich set of data types? I would like to

  believe that this is the case, but I don't think so (at least not yet).



 I believe (this is very foggy) that GPUs can implement non trivial logic

 on there base processing unit, so that conditionals and transcendental

 functions are indeed possible. Where it gets hard is when you don't have

 problems that can be expressed in an embarrassingly parallel manner.

 But NumPy is about embarrassingly parallel calculations, right? I mean:

 a = np.cos(b)

 where b is a 1x1 matrix is *very* embarrassing (in the parallel
 meaning of the term ;-)

Yes. However, it is worth making the distinction between
embarrassingly parallel problems and SIMD problems. Not all
embarrassingly parallel problems are SIMD-capable. GPUs do SIMD, not
generally embarrassing problems. If there are branches, as would be
necessary for many special functions, the GPU does not perform as
well. Basically, every unit has to do both branches because they all
must do the same instruction at the same time, even though the data on
each unit only gets processed by one branch.

cos() is easy. Or at least is so necessary to graphics computing that
it is already a primitive in all (most?) GPU languages. Googling
around shows SIMD code for the basic transcendental functions. I
believe you have to code them differently than you would on a CPU.
Other special functions would simply be hard to do efficiently.

-- 
Robert Kern

I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth.
  -- Umberto Eco
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Fwd: GPU Numpy

2009-09-10 Thread Rohit Garg
 Yes. However, it is worth making the distinction between
 embarrassingly parallel problems and SIMD problems. Not all
 embarrassingly parallel problems are SIMD-capable. GPUs do SIMD, not
 generally embarrassing problems.

GPUs exploit both dimensions of parallelism, both simd (aka
vectorization) and parallelization (aka multicore). And yeah, 99.9% of
the time branching on GPU should be the least/last of your worries if
your problem is data-parallel. There are much worse things than
branchings.

As for SIMD  special functions, branching can certainly be eliminated.
I have written/come across some special functions myself, and I do not
know any case which is difficult to do efficiently on a gpu.
Certainly, I know less than some folks around here. May be you can
contribute a counter example to this discussion.

Regards,

-- 
Rohit Garg

http://rpg-314.blogspot.com/

Senior Undergraduate
Department of Physics
Indian Institute of Technology
Bombay
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Fwd: GPU Numpy

2009-09-09 Thread Francesc Alted
A Tuesday 08 September 2009 21:19:05 George Dahl escrigué:
 Sturla Molden sturla at molden.no writes:
  Erik Tollerud skrev:
   NumPy arrays on the GPU memory is an easy task. But then I would have
   to write the computation in OpenCL's dialect of C99?
  
   This is true to some extent, but also probably difficult to do given
   the fact that paralellizable algorithms are generally more difficult
   to formulate in striaghtforward ways.
 
  Then you have misunderstood me completely. Creating an ndarray that has
  a buffer in graphics memory is not too difficult, given that graphics
  memory can be memory mapped. This has nothing to do with parallelizable
  algorithms or not. It is just memory management. We could make an
  ndarray subclass that quickly puts is content in a buffer accessible to
  the GPU. That is not difficult. But then comes the question of what you
  do with it.
 
  I think many here misunderstands the issue here:
 
  Teraflops peak performance of modern GPUs is impressive. But NumPy
  cannot easily benefit from that. In fact, there is little or nothing to
  gain from optimising in that end. In order for a GPU to help,
  computation must be the time-limiting factor. It is not. There is not
  more to say about using GPUs in NumPy right now.
 
  Take a look at the timings here: http://www.scipy.org/PerformancePython
  It shows that computing with NumPy is more than ten times slower than
  using plain C. This is despite NumPy being written in C. The NumPy code
  does not incur 10 times more floating point operations than the C code.
  The floating point unit does not run in turtle mode when using NumPy.
  NumPy's relative slowness compared to C has nothing to do with floating
  point computation. It is due to inferior memory use (temporary buffers,
  multiple buffer traversals) and memory access being slow. Moving
  computation to the GPU can only make this worse.
 
  Improved memory usage - e.g. through lazy evaluation and JIT compilaton
  of expressions - can give up to a tenfold increase in performance. That
  is where we must start optimising to get a faster NumPy. Incidentally,
  this will  also make it easier to leverage on modern GPUs.
 
  Sturla Molden

 I know that for my work, I can get around an order of a 50-fold speedup
 over numpy using a python wrapper for a simple GPU matrix class.  So I
 might be dealing with a lot of matrix products where I multiply a fixed 512
 by 784 matrix by a 784 by 256 matrix that changes between each matrix
 product, although to really see the largest gains I use a 4096 by 2048
 matrix times a bunch of 2048 by 256 matrices.  If all I was doing were
 those matrix products, it would be even faster, but what I actually am
 doing is a matrix product, then adding a column vector to the result, then
 applying an elementwise logistic sigmoid function and potentially
 generating a matrix of pseudorandom numbers the same shape as my result
 (although not always).  When I do these sorts of workloads, my python
 numpy+GPU matrix class goes so much faster than anything that doesn't use
 the GPU (be it Matlab, or numpy, or C/C++ whatever) that I don't even
 bother measuring the speedups precisely.  In some cases, my python code
 isn't making too many temporaries since what it is doing is so simple, but
 in other cases that is obviously slowing it down a bit.  I have relatively
 complicated jobs that used to take weeks on the CPU can now take hours or
 days.

 Obviously improved memory usage would be more helpful since not everyone
 has access to the sorts of GPUs I use, but tenfold increases in performance
 seem like chump change compared to what I see with the sorts of workloads I
 do.

50-fold increases over NumPy+[Atlas|MKL] are really impressive.  However, the 
point is that these speed-ups can be achieved only when the ratio of 
operations per element is really huge.  Matrix-matrix multiplication (your 
example above) is a paradigmatic example of these scenarios, where 
computations are O(3) (or little smaller than 3, when optimized algorithms are 
used), while memory access is O(2).  Of course, when the matrices
are large, the ratio operations/elements is larger, allowing much better  
speed-ups; this is why GPUs really do a good job here.

The point here is that matrix-matrix multiplications (or, in general, 
functions with a large operation/element ratio) are a *tiny* part of all the 
possible operations between arrays that NumPy supports.  This is why Sturla is 
saying that it is not a good idea to include support of GPUs in all parts of 
NumPy.  A much better strategy is to give NumPy the possibility to link with 
external packages (à la BLAS, LAPACK, Atlas, MKL) that can leverage the 
powerful GPUs for specific problems (e.g. matrix-matrix multiplications).

-- 
Francesc Alted
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Fwd: GPU Numpy

2009-09-09 Thread Francesc Alted
A Tuesday 08 September 2009 23:21:53 Christopher Barker escrigué:
 Also, perhaps a GPU-aware numexpr could be helpful which I think is the
 kind of thing that Sturla was refering to when she wrote:

 Incidentally,  this will  also make it easier to leverage on modern GPUs.

Numexpr mainly supports functions that are meant to be used element-wise, so 
the operation/element ratio is normally 1 (or close to 1).  In these scenarios 
is where improved memory access is much more important than CPU (or, for that 
matter, GPU), and is the reason why numexpr is much more efficient than NumPy 
when evaluating complex expressions like ``a*b+c*sqrt(d)``.

In other words, a GPU-enabled numexpr makes little sense.

-- 
Francesc Alted
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Fwd: GPU Numpy

2009-09-09 Thread Lev Givon
Received from Francesc Alted on Wed, Sep 09, 2009 at 05:18:48AM EDT:

(snip)

 The point here is that matrix-matrix multiplications (or, in general, 
 functions with a large operation/element ratio) are a *tiny* part of all the 
 possible operations between arrays that NumPy supports.  This is why Sturla 
 is 
 saying that it is not a good idea to include support of GPUs in all parts of 
 NumPy.  A much better strategy is to give NumPy the possibility to link with 
 external packages (à la BLAS, LAPACK, Atlas, MKL) that can leverage the 

.. and CULA: http://www.culatools.com/

 powerful GPUs for specific problems (e.g. matrix-matrix multiplications).

L.G.


___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Fwd: GPU Numpy

2009-09-09 Thread Francesc Alted
A Wednesday 09 September 2009 11:26:06 Francesc Alted escrigué:
 A Tuesday 08 September 2009 23:21:53 Christopher Barker escrigué:
  Also, perhaps a GPU-aware numexpr could be helpful which I think is the
  kind of thing that Sturla was refering to when she wrote:
 
  Incidentally,  this will  also make it easier to leverage on modern
  GPUs.

 Numexpr mainly supports functions that are meant to be used element-wise,
 so the operation/element ratio is normally 1 (or close to 1).  In these
 scenarios is where improved memory access is much more important than CPU
 (or, for that matter, GPU), and is the reason why numexpr is much more
 efficient than NumPy when evaluating complex expressions like
 ``a*b+c*sqrt(d)``.

 In other words, a GPU-enabled numexpr makes little sense.

Er, I forgot the fact that one exception to operation/element ratio being 
normally 1 in numexpr is the computation of transcendental functions 
(trigonometrical, exponential, logarithmic...) where the number of CPU 
operations per element is much larger than 1 (normally in the 100s).  Right 
now, there is support for accelerating them in numexpr via VML (Intel's Vector 
Math Library), but I suppose that a library making use of a GPU would be very 
interesting too (and the same applies to numpy).

But again, it makes more sense to rely on external packages or libraries 
(similar to the VML above) for this sort of things.  After having a look at 
CULA (thanks for the pointer, Lev!), my hope is that in short we will see 
other libraries allowing for efficient evaluation of transcendental functions 
using GPUs too.

-- 
Francesc Alted
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Fwd: GPU Numpy

2009-09-09 Thread James Bergstra
On Wed, Sep 9, 2009 at 10:41 AM, Francesc Alted fal...@pytables.org wrote:
 Numexpr mainly supports functions that are meant to be used element-wise,
 so the operation/element ratio is normally 1 (or close to 1).  In these
 scenarios is where improved memory access is much more important than CPU
 (or, for that matter, GPU), and is the reason why numexpr is much more
 efficient than NumPy when evaluating complex expressions like
 ``a*b+c*sqrt(d)``.

 In other words, a GPU-enabled numexpr makes little sense.

There's another way of looking at this, which has been mentioned
before in the conversation, but which I think should be mentioned
again...

The cost of transfer to and from a GPU is very high, compared with
most of the sorts of things that we do with ndarrays.  So the approach
of using libraries to speed up little pieces here and there (i.e. with
VML or ATLAS) but basically to let stock numpy take care of the rest
does not work.  In order to benefit from huge speedups on a GPU, data
need to be on the GPU already.  It is a good idea to perform
low-instruction density functions on the GPU even when the CPU could
go just as fast (or even if the CPU is faster!) just to ensure that
the data stay on the GPU.

Suppose you want to evaluate dot(a*b+c*sqrt(d), e).  The GPU is
great for doing dot(), but if you have to copy the result of the
elemwise expression to the GPU before you can start doing dot(), then
the performance advantage is ruined.  Except for huge matrices, you
might as well just leave the data in the system RAM and use a normal
BLAS library.

So that's why it is a good idea to use the GPU to do some functions
even when the CPU would be faster for them (in isolation).

All that said, there is a possibility that future devices (and some
laptops already?) will use an integrated memory system that might make
'copying to the GPU' a non-issue... but we're not there yet I think...

James
-- 
http://www-etud.iro.umontreal.ca/~bergstrj
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Fwd: GPU Numpy

2009-09-09 Thread Dag Sverre Seljebotn
Christopher Barker wrote:
 George Dahl wrote:
 Sturla Molden sturla at molden.no writes:
 Teraflops peak performance of modern GPUs is impressive. But NumPy 
 cannot easily benefit from that. 
 
 I know that for my work, I can get around an order of a 50-fold speedup over
 numpy using a python wrapper for a simple GPU matrix class.
 
 I think you're talking across each other here. Sturla is referring to 
 making a numpy ndarray gpu-aware and then expecting expressions like:
 
 z = a*x**2 + b*x + c
 
 to go faster when s, b, c, and x are ndarrays.
 
 That's not going to happen.
 
 On the other hand, George is talking about moving higher-level 
 operations (like a matrix product) over to GPU code. This is analogous 
 to numpy.linalg and numpy.dot() using LAPACK routines, and yes, that 
 could help those programs that use such operations.
 
 So a GPU LAPACK would be nice.
 
 This is also analogous to using SWIG, or ctypes or cython or weave, or 
 ??? to move a computationally expensive part of the code over to C.
 
 I think anything that makes it easier to write little bits of your code 
 for the GPU would be pretty cool -- a GPU-aware Cython?

Cython is probably open for that if anybody's interested in implementing 
it/make a student project on it (way too big for GSoC I think, 
unfortunately).

However I'd definitely make it a generic library turning expressions 
into compiled code (either GPU or CPU w/SSE); that could then be used 
both at compile-time from Cython, or at run-time using e.g. SymPy or 
SAGE expressions. Both PyCUDA and CorePy would tend to allow both 
compile-time operation and run-time operation.

-- 
Dag Sverre
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Fwd: GPU Numpy

2009-09-09 Thread Sturla Molden
George Dahl skrev:
  I know that for my work, I can get around an order of a 50-fold 
speedup over
  numpy using a python wrapper for a simple GPU matrix class.  So I 
might be
  dealing with a lot of matrix products where I multiply a fixed 512 by 
784 matrix
  by a 784 by 256 matrix that changes between each matrix product, 
although to
  really see the largest gains I use a 4096 by 2048 matrix times a 
bunch of 2048
  by 256 matrices.



Matrix multiplication is at the core of 3D graphics, and the raison 
d'etre for GPUs. That is specifically what they are designed to do. 
Matrix multiplication scale O(n**3) with floating point operations and 
O(n**2) with memory access. That is GPUs gives fast 3D graphics (matrix 
multiplications) by speeding up floating point operations.

GPUs makes sence for certain level-3 BLAS calls, but that really belongs 
in BLAS, not in NumPy's core. One could e.g. consider linking with a 
BLAS wrapper that directs these special cases to the GPU and the rest to 
ATLAS / MKL / netlib BLAS.

Sturla Molden
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Fwd: GPU Numpy

2009-09-09 Thread Sturla Molden
James Bergstra skrev:
 Suppose you want to evaluate dot(a*b+c*sqrt(d), e).  The GPU is
 great for doing dot(), 
The CPU is equally great (or better?) for doing dot(). In both cases:

- memory access scale O(n) for dot producs.
- computation scale O(n) for dot producs.
- memory is low
- computation is fast (faster for GPU)

In both cases, the floating point unit is starved. That means it could 
do a lot more work if memory were faster.

For the GPU to be faster than CPU, you have to have a situation where 
computation dominates over memory access. Matrix-matrix multiplication 
is one such example. This is what GPUs are designed to do, as it is the 
major bootleneck in 3D graphics.

The proper way to speed up dot(a*b+c*sqrt(d), e) is to get rid of 
temporary intermediates. That is, in Python pseudo-code:

result = 0
for i in range(n):
result += (a[i]*b[i] + c[i]*sqrt(d[i])) * e[i]

instead of:

tmp0 = empty(n)
for i in range(n):
   tmp0[i] = a[i] * b[i]

tmp1 = empty(n)
for i in range(n):
   tmp1[i] = sqrt(d[i])

tmp2 = empty(n)
for i in range(n):
   tmp2[i] = c[i] * tmp1[i]

tmp3 = empty(n)
for i in range(n):
   tmp3[i]  = tmp0[i] + tmp2[i]

result = 0
for i in range(n): 
   result += tmp3[i] * e[i]


It is this complication that makes NumPy an order of magnitude slower 
than hand-crafted C (but still much faster than pure Python!) Adding in 
GPUs will not change this. The amount of computation (flop count) is the 
same, so it cannot be the source of the slowness.


Sturla Molden



___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Fwd: GPU Numpy

2009-09-09 Thread David Warde-Farley
On 10-Sep-09, at 12:47 AM, Sturla Molden wrote:

 The CPU is equally great (or better?) for doing dot(). In both cases:

 - memory access scale O(n) for dot producs.
 - computation scale O(n) for dot producs.
 - memory is low
 - computation is fast (faster for GPU)

You do realize that the throughput from onboard (video) RAM is going  
to be much higher, right? It's not just the parallelization but the  
memory bandwidth. And as James pointed out, if you can keep most of  
your intermediate computation on-card, you stand to benefit immensely,  
even if doing some operations where the GPU provides no tangible  
benefit (i.e. the benefit is in aggregate and avoiding copies).

FWIW I agree with you that NumPy isn't the place for GPU stuff to  
happen. In the short to medium term we need a way to make it simpler  
for naturally expressed computations not go hog wild with temporary  
allocations (it's a very hard problem given the constraints of the  
language). In the long term I envision something with flexible enough  
machinery to be manipulating objects in GPU memory with the same ease  
as in main memory, but I think the path to that lies in increasing the  
generality and flexibility of the interfaces exposed.

David
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Fwd: GPU Numpy

2009-09-09 Thread Fernando Perez
On Wed, Sep 9, 2009 at 9:47 PM, Sturla Moldenstu...@molden.no wrote:
 James Bergstra skrev:
 Suppose you want to evaluate dot(a*b+c*sqrt(d), e).  The GPU is
 great for doing dot(),
 The CPU is equally great (or better?) for doing dot(). In both cases:

 - memory access scale O(n) for dot producs.
 - computation scale O(n) for dot producs.

Remember that we have  a little terminology ambiguity here: in numpy,
dot(a,b) is used to describe both the vector dot product, an O(n)
operation if a and b are n-element vectors, and the matrix product, an
O(n**3) operation if a and b are both nxn square matrices.

Just a clarification...

Cheers,

f
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Fwd: GPU Numpy

2009-09-08 Thread Christopher Barker
George Dahl wrote:
 Sturla Molden sturla at molden.no writes:
 Teraflops peak performance of modern GPUs is impressive. But NumPy 
 cannot easily benefit from that. 

 I know that for my work, I can get around an order of a 50-fold speedup over
 numpy using a python wrapper for a simple GPU matrix class.

I think you're talking across each other here. Sturla is referring to 
making a numpy ndarray gpu-aware and then expecting expressions like:

z = a*x**2 + b*x + c

to go faster when s, b, c, and x are ndarrays.

That's not going to happen.

On the other hand, George is talking about moving higher-level 
operations (like a matrix product) over to GPU code. This is analogous 
to numpy.linalg and numpy.dot() using LAPACK routines, and yes, that 
could help those programs that use such operations.

So a GPU LAPACK would be nice.

This is also analogous to using SWIG, or ctypes or cython or weave, or 
??? to move a computationally expensive part of the code over to C.

I think anything that makes it easier to write little bits of your code 
for the GPU would be pretty cool -- a GPU-aware Cython?

Also, perhaps a GPU-aware numexpr could be helpful which I think is the 
kind of thing that Sturla was refering to when she wrote:

Incidentally,  this will  also make it easier to leverage on modern GPUs.

-Chris











-- 
Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/ORR(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Fwd: GPU Numpy

2009-09-08 Thread George Dahl
Sturla Molden sturla at molden.no writes:

 
 Erik Tollerud skrev:
  NumPy arrays on the GPU memory is an easy task. But then I would have to
  write the computation in OpenCL's dialect of C99? 
  This is true to some extent, but also probably difficult to do given
  the fact that paralellizable algorithms are generally more difficult
  to formulate in striaghtforward ways. 
 Then you have misunderstood me completely. Creating an ndarray that has 
 a buffer in graphics memory is not too difficult, given that graphics 
 memory can be memory mapped. This has nothing to do with parallelizable 
 algorithms or not. It is just memory management. We could make an 
 ndarray subclass that quickly puts is content in a buffer accessible to 
 the GPU. That is not difficult. But then comes the question of what you 
 do with it.
 
 I think many here misunderstands the issue here:
 
 Teraflops peak performance of modern GPUs is impressive. But NumPy 
 cannot easily benefit from that. In fact, there is little or nothing to 
 gain from optimising in that end. In order for a GPU to help, 
 computation must be the time-limiting factor. It is not. There is not 
 more to say about using GPUs in NumPy right now.
 
 Take a look at the timings here: http://www.scipy.org/PerformancePython 
 It shows that computing with NumPy is more than ten times slower than 
 using plain C. This is despite NumPy being written in C. The NumPy code 
 does not incur 10 times more floating point operations than the C code. 
 The floating point unit does not run in turtle mode when using NumPy. 
 NumPy's relative slowness compared to C has nothing to do with floating 
 point computation. It is due to inferior memory use (temporary buffers, 
 multiple buffer traversals) and memory access being slow. Moving 
 computation to the GPU can only make this worse.
 
 Improved memory usage - e.g. through lazy evaluation and JIT compilaton 
 of expressions - can give up to a tenfold increase in performance. That 
 is where we must start optimising to get a faster NumPy. Incidentally, 
 this will  also make it easier to leverage on modern GPUs.
 
 Sturla Molden
 


I know that for my work, I can get around an order of a 50-fold speedup over
numpy using a python wrapper for a simple GPU matrix class.  So I might be
dealing with a lot of matrix products where I multiply a fixed 512 by 784 matrix
by a 784 by 256 matrix that changes between each matrix product, although to
really see the largest gains I use a 4096 by 2048 matrix times a bunch of 2048
by 256 matrices.  If all I was doing were those matrix products, it would be
even faster, but what I actually am doing is a matrix product, then adding a
column vector to the result, then applying an elementwise logistic sigmoid
function and potentially generating a matrix of pseudorandom numbers the same
shape as my result (although not always).  When I do these sorts of workloads,
my python numpy+GPU matrix class goes so much faster than anything that doesn't
use the GPU (be it Matlab, or numpy, or C/C++ whatever) that I don't even bother
measuring the speedups precisely.  In some cases, my python code isn't making
too many temporaries since what it is doing is so simple, but in other cases
that is obviously slowing it down a bit.  I have relatively complicated jobs
that used to take weeks on the CPU can now take hours or days.

Obviously improved memory usage would be more helpful since not everyone has
access to the sorts of GPUs I use, but tenfold increases in performance seem
like chump change compared to what I see with the sorts of workloads I do.

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Fwd: GPU Numpy

2009-09-02 Thread Romain Brette
Hi everyone,

In case anyone is interested, I just set up a google group to discuss 
GPU-based simulation for our Python neural simulator Brian:
http://groups.google.fr/group/brian-on-gpu
Our simulator relies heavily Numpy. I would be very happy if the GPU 
experts here would like to share their expertise.

Best,
Romain

Romain Brette a écrit :
 Sturla Molden a écrit :
 Thus, here is my plan:

 1. a special context-manager class
 2. immutable arrays inside with statement
 3. lazy evaluation: expressions build up a parse tree
 4. dynamic code generation
 5. evaluation on exit

 
 There seems to be some similarity with what we want to do to accelerate 
 our neural simulations (briansimulator.org), as described here:
 http://brian.svn.sourceforge.net/viewvc/brian/trunk/dev/BEPs/BEP-9-Automatic%20code%20generation.txt?view=markup
 (by the way BEP is Brian Enhancement Proposal)
 The speed-up factor we got in our experimental code with GPU is very 
 substantial when there are many neurons (= large vectors, e.g. 10 000 
 elements), even when operations are simple.
 
 Romain

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Fwd: GPU Numpy

2009-08-21 Thread Sturla Molden
Erik Tollerud skrev:
 NumPy arrays on the GPU memory is an easy task. But then I would have to
 write the computation in OpenCL's dialect of C99? 
 This is true to some extent, but also probably difficult to do given
 the fact that paralellizable algorithms are generally more difficult
 to formulate in striaghtforward ways. 
Then you have misunderstood me completely. Creating an ndarray that has 
a buffer in graphics memory is not too difficult, given that graphics 
memory can be memory mapped. This has nothing to do with parallelizable 
algorithms or not. It is just memory management. We could make an 
ndarray subclass that quickly puts is content in a buffer accessible to 
the GPU. That is not difficult. But then comes the question of what you 
do with it.

I think many here misunderstands the issue here:

Teraflops peak performance of modern GPUs is impressive. But NumPy 
cannot easily benefit from that. In fact, there is little or nothing to 
gain from optimising in that end. In order for a GPU to help, 
computation must be the time-limiting factor. It is not. There is not 
more to say about using GPUs in NumPy right now.

Take a look at the timings here: http://www.scipy.org/PerformancePython 
It shows that computing with NumPy is more than ten times slower than 
using plain C. This is despite NumPy being written in C. The NumPy code 
does not incur 10 times more floating point operations than the C code. 
The floating point unit does not run in turtle mode when using NumPy. 
NumPy's relative slowness compared to C has nothing to do with floating 
point computation. It is due to inferior memory use (temporary buffers, 
multiple buffer traversals) and memory access being slow. Moving 
computation to the GPU can only make this worse.

Improved memory usage - e.g. through lazy evaluation and JIT compilaton 
of expressions - can give up to a tenfold increase in performance. That 
is where we must start optimising to get a faster NumPy. Incidentally, 
this will  also make it easier to leverage on modern GPUs.

Sturla Molden
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Fwd: GPU Numpy

2009-08-20 Thread Erik Tollerud
I realize this topic is a bit old, but I couldn't help but add
something I forgot to mention earlier...

 I mean, once the computations are moved elsewhere numpy is basically a
 convenient way to address memory.

 That is how I mostly use NumPy, though. Computations I often do in
 Fortran 95 or C.

 NumPy arrays on the GPU memory is an easy task. But then I would have to
 write the computation in OpenCL's dialect of C99? But I'd rather program
 everything in Python if I could. Details like GPU and OpenCL should be
 hidden away. Nice looking Python with NumPy is much easier to read and
 write. That is why I'd like to see a code generator (i.e. JIT compiler)
 for NumPy.

This is true to some extent, but also probably difficult to do given
the fact that paralellizable algorithms are generally more difficult
to formulate in striaghtforward ways.  In the intermediate-term, I
think there is value in having numpy implement some sort of interface
to OpenCL or cuda - I can easily see an explosion of different
bindings (it's already starting), and having a canonical way encoded
in numpy or scipy is probably the best way to mitigate the inevitable
compatibility problems... I'm partial to the way pycuda can do it
(basically, just export numpy arrays to the GPU and let you write the
code from there), but the main point is to just get some basic
compatibility in pretty quickly, as I think this GPGPU is here to
stay...
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Fwd: GPU Numpy

2009-08-07 Thread Romain Brette
Sturla Molden a écrit :
 Thus, here is my plan:
 
 1. a special context-manager class
 2. immutable arrays inside with statement
 3. lazy evaluation: expressions build up a parse tree
 4. dynamic code generation
 5. evaluation on exit
 

There seems to be some similarity with what we want to do to accelerate 
our neural simulations (briansimulator.org), as described here:
http://brian.svn.sourceforge.net/viewvc/brian/trunk/dev/BEPs/BEP-9-Automatic%20code%20generation.txt?view=markup
(by the way BEP is Brian Enhancement Proposal)
The speed-up factor we got in our experimental code with GPU is very 
substantial when there are many neurons (= large vectors, e.g. 10 000 
elements), even when operations are simple.

Romain

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Fwd: GPU Numpy

2009-08-06 Thread Charles R Harris
On Thu, Aug 6, 2009 at 11:12 AM, James Bergstra
bergs...@iro.umontreal.cawrote:

 David Warde-Farley dwf at cs.toronto.edu writes:
  It did inspire some of our colleagues in Montreal to create this,
  though:
 
   http://code.google.com/p/cuda-ndarray/
 
  I gather it is VERY early in development, but I'm sure they'd love
  contributions!
 
 
 Hi David,
 That does look quite close to what I imagined, probably a good start then!
 Romain

 Hi, I'm one of the devs for that project.   Thanks David for the link.
  I put some text on the homepage so it's a little more
 self-explanatory.  We do welcome contributions.

 I feel like I must be reinventing the wheel on this, so I'd really
 appreciate it if someone who knows of a similar project would let me
 know about it. Otherwise we'll keep plugging away at replicating core
 ndarray interface elements (operators, math.h-type functions, array
 indexing, etc.)

 http://code.google.com/p/cuda-ndarray/


I almost looks like you are reimplementing numpy, in c++ no less. Is there
any reason why you aren't working with a numpy branch and just adding
ufuncs? I'm also curious if you have thoughts about how to use the GPU
pipelines in parallel.

Chuck
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Fwd: GPU Numpy

2009-08-06 Thread James Bergstra
On Thu, Aug 6, 2009 at 1:19 PM, Charles R
Harrischarlesr.har...@gmail.com wrote:
 I almost looks like you are reimplementing numpy, in c++ no less. Is there
 any reason why you aren't working with a numpy branch and just adding
 ufuncs?

I don't know how that would work.  The Ufuncs need a datatype to work
with, and AFAIK, it would break everything if a numpy ndarray pointed
to memory on the GPU.  Could you explain what you mean a little more?

 I'm also curious if you have thoughts about how to use the GPU
 pipelines in parallel.

Current thinking for ufunc type computations:
1) divide up the tensors into subtensors whose dimensions have
power-of-two sizes (this permits a fast integer - ndarray coordinate
computation using bit shifting),
2) launch a kernel for each subtensor in it's own stream to use
parallel pipelines.
3) sync and return.

This is a pain to do without automatic code generation though.
Currently we're using macros, but that's not pretty.
C++ has templates, which we don't really use yet, but were planning on
using.  These have some power to generate code.
The 'theano' project (www.pylearn.org/theano) for which cuda-ndarray
was created has a more powerful code generation mechanism similar to
weave.   This algorithm is used in theano-cuda-ndarray.
Scipy.weave could be very useful for generating code for specific
shapes/ndims on demand, if weave could use nvcc.

James
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Fwd: GPU Numpy

2009-08-06 Thread Erik Tollerud
Note that this is from a user perspective, as I have no particular plan of
developing the details of this implementation, but I've thought for a long
time that GPU support could be great for numpy (I would also vote for OpenCL
support over cuda, although conceptually they seem quite similar)...
But  what exactly would the large-scale plan be?  One of the advantages of
GPGPUs is that they are particularly suited to rather complicated
paralellizable algorithms, and the numpy-level basic operations are just the
simple arithmatic operations.  So while I'd love to see it working, it's
unclear to me exactly how much is gained at the core numpy level, especially
given that it's limited to single-precision on most GPUs.

Now linear algebra or FFTs on a GPU would probably be a huge boon, I'll
admit - especially if it's in the form of a drop-in replacement for the
numpy or scipy versions.

By the way, I noticed no one mentioned the GPUArray class in pycuda (and it
looks like there's something similar in the pyopencl) - seems like that's
already done a fair amount of the work...
http://documen.tician.de/pycuda/array.html#pycuda.gpuarray.GPUArray



On Thu, Aug 6, 2009 at 10:41 AM, James Bergstra
bergs...@iro.umontreal.cawrote:

 On Thu, Aug 6, 2009 at 1:19 PM, Charles R
 Harrischarlesr.har...@gmail.com wrote:
  I almost looks like you are reimplementing numpy, in c++ no less. Is
 there
  any reason why you aren't working with a numpy branch and just adding
  ufuncs?

 I don't know how that would work.  The Ufuncs need a datatype to work
 with, and AFAIK, it would break everything if a numpy ndarray pointed
 to memory on the GPU.  Could you explain what you mean a little more?

  I'm also curious if you have thoughts about how to use the GPU
  pipelines in parallel.

 Current thinking for ufunc type computations:
 1) divide up the tensors into subtensors whose dimensions have
 power-of-two sizes (this permits a fast integer - ndarray coordinate
 computation using bit shifting),
 2) launch a kernel for each subtensor in it's own stream to use
 parallel pipelines.
 3) sync and return.

 This is a pain to do without automatic code generation though.
 Currently we're using macros, but that's not pretty.
 C++ has templates, which we don't really use yet, but were planning on
 using.  These have some power to generate code.
 The 'theano' project (www.pylearn.org/theano) for which cuda-ndarray
 was created has a more powerful code generation mechanism similar to
 weave.   This algorithm is used in theano-cuda-ndarray.
 Scipy.weave could be very useful for generating code for specific
 shapes/ndims on demand, if weave could use nvcc.

 James
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Fwd: GPU Numpy

2009-08-06 Thread Matthieu Brucher
2009/8/6 Erik Tollerud erik.tolle...@gmail.com:
 Note that this is from a user perspective, as I have no particular plan of
 developing the details of this implementation, but I've thought for a long
 time that GPU support could be great for numpy (I would also vote for OpenCL
 support over cuda, although conceptually they seem quite similar)...
 But  what exactly would the large-scale plan be?  One of the advantages of
 GPGPUs is that they are particularly suited to rather complicated
 paralellizable algorithms,

You mean simple parallizable algorithms, I suppose?

 and the numpy-level basic operations are just the
 simple arithmatic operations.  So while I'd love to see it working, it's
 unclear to me exactly how much is gained at the core numpy level, especially
 given that it's limited to single-precision on most GPUs.
 Now linear algebra or FFTs on a GPU would probably be a huge boon, I'll
 admit - especially if it's in the form of a drop-in replacement for the
 numpy or scipy versions.
 By the way, I noticed no one mentioned the GPUArray class in pycuda (and it
 looks like there's something similar in the pyopencl) - seems like that's
 already done a fair amount of the work...
 http://documen.tician.de/pycuda/array.html#pycuda.gpuarray.GPUArray


 On Thu, Aug 6, 2009 at 10:41 AM, James Bergstra bergs...@iro.umontreal.ca
 wrote:

 On Thu, Aug 6, 2009 at 1:19 PM, Charles R
 Harrischarlesr.har...@gmail.com wrote:
  I almost looks like you are reimplementing numpy, in c++ no less. Is
  there
  any reason why you aren't working with a numpy branch and just adding
  ufuncs?

 I don't know how that would work.  The Ufuncs need a datatype to work
 with, and AFAIK, it would break everything if a numpy ndarray pointed
 to memory on the GPU.  Could you explain what you mean a little more?

  I'm also curious if you have thoughts about how to use the GPU
  pipelines in parallel.

 Current thinking for ufunc type computations:
 1) divide up the tensors into subtensors whose dimensions have
 power-of-two sizes (this permits a fast integer - ndarray coordinate
 computation using bit shifting),
 2) launch a kernel for each subtensor in it's own stream to use
 parallel pipelines.
 3) sync and return.

 This is a pain to do without automatic code generation though.
 Currently we're using macros, but that's not pretty.
 C++ has templates, which we don't really use yet, but were planning on
 using.  These have some power to generate code.
 The 'theano' project (www.pylearn.org/theano) for which cuda-ndarray
 was created has a more powerful code generation mechanism similar to
 weave.   This algorithm is used in theano-cuda-ndarray.
 Scipy.weave could be very useful for generating code for specific
 shapes/ndims on demand, if weave could use nvcc.

 James
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion



 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion





-- 
Information System Engineer, Ph.D.
Website: http://matthieu-brucher.developpez.com/
Blogs: http://matt.eifelle.com and http://blog.developpez.com/?blog=92
LinkedIn: http://www.linkedin.com/in/matthieubrucher
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Fwd: GPU Numpy

2009-08-06 Thread David Warde-Farley
On 6-Aug-09, at 2:54 PM, Erik Tollerud wrote:

 Now linear algebra or FFTs on a GPU would probably be a huge boon,  
 I'll
 admit - especially if it's in the form of a drop-in replacement for  
 the
 numpy or scipy versions.

The word I'm hearing from people in my direct acquaintance who are  
using it is that if you have code that even do lots of matrix  
multiplies, nevermind solving systems or anything like that, the  
speedup is several orders of magnitude. Things that used to take weeks  
now take a day or two. If you can deal with the loss of precision it's  
really quite worth it.

 By the way, I noticed no one mentioned the GPUArray class in pycuda  
 (and it
 looks like there's something similar in the pyopencl) - seems like  
 that's
 already done a fair amount of the work...
 http://documen.tician.de/pycuda/array.html#pycuda.gpuarray.GPUArray


This seems like a great start, I agree. The lack of any documentation  
on 'dot' is worrying, though.

David
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Fwd: GPU Numpy

2009-08-06 Thread Sturla Molden

 Now linear algebra or FFTs on a GPU would probably be a huge boon, 
 I'll admit - especially if it's in the form of a drop-in replacement 
 for the numpy or scipy versions.


NumPy generate temporary arrays for expressions involving ndarrays. This 
extra allocation and copying often takes more time than the computation. 
With GPGPUs, we have to bus the data to and from VRAM as well. D. Knuth 
quoted Hoare saying that premature optimization is the root of all 
evil. Optimizing computation when the bottleneck is memory is premature.

In order to improve on this, I think we have to add lazy evaluation to 
NumPy. That is, an operator should not return a temporary array but a 
symbolic expression. So if we have an expression like

y = a*x + b

it should not evalute a*x into a temporary array. Rather, the operators 
would build up a parse tree like

y = add(multiply(a,x),b)

and evalute the whole expression  later on.

This would require two things: First we need dynamic code generation, 
which incidentally is what OpenCL is all about. I.e. OpenCL is 
dynamically invoked compiler; there is a function 
clCreateProgramFromSource, which  does just what it says. Second, we 
need arrays to be immutable. This is very important. If arrays are not 
immutable, code like this could fail:

y = a*x + b
x[0] = 1235512371235

With lazy evaluation, the memory overhead would be much smaller. The 
GPGPU would also get a more complex expressions to use as a kernels.

There should be an option of running this on the CPU, possibly using 
OpenMP for multi-threading. We could either depend on a compiler (C or 
Fortran) being installed, or use opcodes for a dedicated virtual machine 
(cf. what numexpr does).

In order to reduce the effect of immutable arrays, we could introduce a 
context-manager. Inside the with statement, all arrays would be 
immutable. Second, the __exit__ method could trigger the code generator 
and do all the evaluation. So we would get something like this:

# normal numpy here

with numpy.accelerator():

# arrays become immutable
# lazy evaluation
   
# code generation and evaluation on exit

# normal numpy continues here


Thus, here is my plan:

1. a special context-manager class
2. immutable arrays inside with statement
3. lazy evaluation: expressions build up a parse tree
4. dynamic code generation
5. evaluation on exit

I guess it is possibly to find ways to speed up this as well. If a 
context manager would always generate the same OpenCL code, the with 
statement would only need to execute once (we could raise an exception 
on enter to jump directly to exit).

It is possibly to create a superfast NumPy. But just plugging GPGPUs 
into the current design would be premature. In NumPy's current state, 
with mutable ndarrays and operators generating temporary arrays, there 
is not much to gain from introducing GPGPUs. It would only be beneficial 
in computationally demanding parts like FFTs and solvers for linear 
algebra and differential equations. Ufuncs with trancendental functions 
might also benefit. SciPy would certainly benefit more from GPGPUs than 
NumPy.

Just my five cents :-)

Regards,
Sturla Molden
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Fwd: GPU Numpy

2009-08-06 Thread Robert Kern
On Thu, Aug 6, 2009 at 15:57, Sturla Moldenstu...@molden.no wrote:

 Now linear algebra or FFTs on a GPU would probably be a huge boon,
 I'll admit - especially if it's in the form of a drop-in replacement
 for the numpy or scipy versions.

 NumPy generate temporary arrays for expressions involving ndarrays. This
 extra allocation and copying often takes more time than the computation.
 With GPGPUs, we have to bus the data to and from VRAM as well. D. Knuth
 quoted Hoare saying that premature optimization is the root of all
 evil. Optimizing computation when the bottleneck is memory is premature.

 It is possibly to create a superfast NumPy. But just plugging GPGPUs
 into the current design would be premature. In NumPy's current state,
 with mutable ndarrays and operators generating temporary arrays, there
 is not much to gain from introducing GPGPUs. It would only be beneficial
 in computationally demanding parts like FFTs and solvers for linear
 algebra and differential equations.

I believe that is exactly the point that Erik is making. :-)

-- 
Robert Kern

I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth.
  -- Umberto Eco
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Fwd: GPU Numpy

2009-08-06 Thread Sturla Molden
Robert Kern wrote:
 I believe that is exactly the point that Erik is making. :-)
   
I wasn't arguing against him, just suggesting a solution. :-)

I have big hopes for lazy evaluation, if we can find a way to to it right.

Sturla
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Fwd: GPU Numpy

2009-08-06 Thread James Bergstra
On Thu, Aug 6, 2009 at 4:57 PM, Sturla Moldenstu...@molden.no wrote:

 Now linear algebra or FFTs on a GPU would probably be a huge boon,
 I'll admit - especially if it's in the form of a drop-in replacement
 for the numpy or scipy versions.


 NumPy generate temporary arrays for expressions involving ndarrays. This
 extra allocation and copying often takes more time than the computation.
 With GPGPUs, we have to bus the data to and from VRAM as well. D. Knuth
 quoted Hoare saying that premature optimization is the root of all
 evil. Optimizing computation when the bottleneck is memory is premature.

 In order to improve on this, I think we have to add lazy evaluation to
 NumPy. That is, an operator should not return a temporary array but a
 symbolic expression. So if we have an expression like

    y = a*x + b

 it should not evalute a*x into a temporary array. Rather, the operators
 would build up a parse tree like

    y = add(multiply(a,x),b)

 and evalute the whole expression  later on.
[snip]
 Regards,
 Sturla Molden
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


Hi Sturla,

The plan you describe is a good one, and Theano
(www.pylearn.org/theano) almost exactly implements it.  You should
check it out.  It does not use 'with' syntax at the moment, but it
could provide the backend machinery for your mechanism if you want to
go forward with that.  Theano provides
- symbolic expression building for a big subset of what numpy can do
(and a few things that it doesn't)
- expression optimization (for faster and more accurate computations)
- dynamic code generation
- cacheing of compiled functions to disk.

Also, when you have a symbolic expression graph you can do cute stuff
like automatic differentiation.  We're currently working on the bridge
between theano and cuda so that you declare certain inputs as residing
on the GPU instead of the host memory, so you don't have to transfer
things to and from host memory as much.

James
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Fwd: GPU Numpy

2009-08-06 Thread Charles R Harris
On Thu, Aug 6, 2009 at 3:29 PM, James Bergstra bergs...@iro.umontreal.cawrote:

 On Thu, Aug 6, 2009 at 4:57 PM, Sturla Moldenstu...@molden.no wrote:
 
  Now linear algebra or FFTs on a GPU would probably be a huge boon,
  I'll admit - especially if it's in the form of a drop-in replacement
  for the numpy or scipy versions.
 
 
  NumPy generate temporary arrays for expressions involving ndarrays. This
  extra allocation and copying often takes more time than the computation.
  With GPGPUs, we have to bus the data to and from VRAM as well. D. Knuth
  quoted Hoare saying that premature optimization is the root of all
  evil. Optimizing computation when the bottleneck is memory is premature.
 
  In order to improve on this, I think we have to add lazy evaluation to
  NumPy. That is, an operator should not return a temporary array but a
  symbolic expression. So if we have an expression like
 
 y = a*x + b
 
  it should not evalute a*x into a temporary array. Rather, the operators
  would build up a parse tree like
 
 y = add(multiply(a,x),b)
 
  and evalute the whole expression  later on.
 [snip]
  Regards,
  Sturla Molden
  ___
  NumPy-Discussion mailing list
  NumPy-Discussion@scipy.org
  http://mail.scipy.org/mailman/listinfo/numpy-discussion
 

 Hi Sturla,

 The plan you describe is a good one, and Theano
 (www.pylearn.org/theano) almost exactly implements it.  You should
 check it out.  It does not use 'with' syntax at the moment, but it
 could provide the backend machinery for your mechanism if you want to
 go forward with that.  Theano provides
 - symbolic expression building for a big subset of what numpy can do
 (and a few things that it doesn't)
 - expression optimization (for faster and more accurate computations)
 - dynamic code generation
 - cacheing of compiled functions to disk.

 Also, when you have a symbolic expression graph you can do cute stuff
 like automatic differentiation.  We're currently working on the bridge
 between theano and cuda so that you declare certain inputs as residing
 on the GPU instead of the host memory, so you don't have to transfer
 things to and from host memory as much.


So what simple things could numpy implement that would help here? It almost
sounds like numpy would mostly be an interface to python and the gpu would
execute specialized code written and compiled for specific problems. Whether
the code that gets compiled is written using lazy evaluation (ala Sturla),
or is expressed some other way seems like an independent issue. It sounds
like one important thing would be having arrays that reside on the GPU.

Chuck
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Fwd: GPU Numpy

2009-08-06 Thread Sturla Molden
Charles R Harris wrote:
 Whether the code that gets compiled is written using lazy evaluation 
 (ala Sturla), or is expressed some other way seems like an independent 
 issue. It sounds like one important thing would be having arrays that 
 reside on the GPU.
Memory management is slow compared to computation. Operations like 
malloc, free and memcpy is not faster for VRAM than for RAM. There will 
be no benefit from the GPU if the bottleneck is memory. That is why we 
need to get rid of the creation of temporary arrays, hence lazy evaluation.

Having arrays reside in VRAM would reduce the communication between RAM 
and VRAM, but the problem with temporary arrays is still there.

Also VRAM tends to be a limited resource.

Sturla

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Fwd: GPU Numpy

2009-08-06 Thread Sturla Molden
Sturla Molden wrote:
 Memory management is slow compared to computation. Operations like 
 malloc, free and memcpy is not faster for VRAM than for RAM. 

Actually it's not VRAM anymore, but whatever you call the memory 
dedicated to the GPU.

It is cheap to put 8 GB of RAM into a computer, but graphics cards with 
more than 1 GB memory are expensive and uncommon on e.g. laptops. And 
this memory will be needed for other things as well, e.g. graphics.

Sturla



___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Fwd: GPU Numpy

2009-08-06 Thread Charles R Harris
On Thu, Aug 6, 2009 at 4:36 PM, Sturla Molden stu...@molden.no wrote:

 Charles R Harris wrote:
  Whether the code that gets compiled is written using lazy evaluation
  (ala Sturla), or is expressed some other way seems like an independent
  issue. It sounds like one important thing would be having arrays that
  reside on the GPU.
 Memory management is slow compared to computation. Operations like
 malloc, free and memcpy is not faster for VRAM than for RAM. There will
 be no benefit from the GPU if the bottleneck is memory. That is why we
 need to get rid of the creation of temporary arrays, hence lazy evaluation.

 Having arrays reside in VRAM would reduce the communication between RAM
 and VRAM, but the problem with temporary arrays is still there.


I'm not arguing with that, but I regard it as a separate problem. One could,
after all, simply use an expression to GPU compiler to generate modules. The
question is what simple additions we can make to numpy so that it acts as a
convenient io channel. I mean, once the computations are moved elsewhere
numpy is basically a convenient way to address memory.



 Also VRAM tends to be a limited resource.


But getting less so. These days it comes in gigabytes and there is no reason
why it shouldn't soon excede what many folks have for main memory.

Chuck
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Fwd: GPU Numpy

2009-08-06 Thread Sturla Molden
Charles R Harris wrote:

 I mean, once the computations are moved elsewhere numpy is basically a 
 convenient way to address memory.

That is how I mostly use NumPy, though. Computations I often do in 
Fortran 95 or C.

NumPy arrays on the GPU memory is an easy task. But then I would have to 
write the computation in OpenCL's dialect of C99? But I'd rather program 
everything in Python if I could. Details like GPU and OpenCL should be 
hidden away. Nice looking Python with NumPy is much easier to read and 
write. That is why I'd like to see a code generator (i.e. JIT compiler) 
for NumPy.


Sturla


___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Fwd: GPU Numpy

2009-08-06 Thread Sturla Molden
James Bergstra wrote:
 The plan you describe is a good one, and Theano
 (www.pylearn.org/theano) almost exactly implements it.  You should
 check it out.  It does not use 'with' syntax at the moment, but it
 could provide the backend machinery for your mechanism if you want to
 go forward with that.  Theano provides
 - symbolic expression building for a big subset of what numpy can do
 (and a few things that it doesn't)
 - expression optimization (for faster and more accurate computations)
 - dynamic code generation
 - cacheing of compiled functions to disk.
Thank you James, theano looks great. :-D

Sturla


___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Fwd: GPU Numpy

2009-08-06 Thread Charles R Harris
On Thu, Aug 6, 2009 at 5:10 PM, Sturla Molden stu...@molden.no wrote:

 Charles R Harris wrote:

  I mean, once the computations are moved elsewhere numpy is basically a
  convenient way to address memory.

 That is how I mostly use NumPy, though. Computations I often do in
 Fortran 95 or C.

 NumPy arrays on the GPU memory is an easy task.


Glad to hear it. So maybe some way to specify and track where the memory is
allocated would be helpful. Travis wants to add a dictionary to ndarrays and
that might be useful here.

But then I would have to
 write the computation in OpenCL's dialect of C99? But I'd rather program
 everything in Python if I could. Details like GPU and OpenCL should be
 hidden away. Nice looking Python with NumPy is much easier to read and
 write. That is why I'd like to see a code generator (i.e. JIT compiler)
 for NumPy.


Yes, but that is a language/compiler problem. I'm thinking of what tools
numpy can offer that would help people experimenting with different
approaches to using GPUs. At some point we might want to adopt a working
approach but now seems a bit early for that.

Chuck
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Fwd: GPU Numpy

2009-08-06 Thread Fernando Perez
On Thu, Aug 6, 2009 at 1:57 PM, Sturla Moldenstu...@molden.no wrote:
 In order to reduce the effect of immutable arrays, we could introduce a
 context-manager. Inside the with statement, all arrays would be
 immutable. Second, the __exit__ method could trigger the code generator
 and do all the evaluation. So we would get something like this:

    # normal numpy here

    with numpy.accelerator():

        # arrays become immutable
        # lazy evaluation

        # code generation and evaluation on exit

    # normal numpy continues here


 Thus, here is my plan:

 1. a special context-manager class
 2. immutable arrays inside with statement
 3. lazy evaluation: expressions build up a parse tree
 4. dynamic code generation
 5. evaluation on exit

You will face one issue here: unless you raise a special exception
inside the with block, the python interpreter will unconditionally
execute that code without your control.  I had a long talk about this
with Alex Martelli last year at scipy, where I pitched the idea of
allowing context managers to have an optional third method,
__execute__, which would get the code block in the with statement for
execution.  He was fairly pessimistic about the possibility of this
making its way into python, mostly (if I recall correctly) because of
scoping issues: the with statement does not introduce a new scope, so
you'd need to pass to this method the code plus the locals/globals of
the entire enclosing scope, which felt messy.

There was also the thorny question of how to pass the code block.
Source? Bytecode? What?  In many environments the source may not be
available.  Last year I wrote a gross hack to do this, which you can
find here:

http://bazaar.launchpad.net/~ipython-dev/ipython/0.10/annotate/head%3A/IPython/kernel/contexts.py

The idea is that it would be used by code like this (note, this
doesn't actually work right now):

def test_simple():

# XXX - for now, we need a running cluster to be started separately.  The
# daemon work is almost finished, and will make much of this unnecessary.
from IPython.kernel import client
mec = client.MultiEngineClient(('127.0.0.1',10105))

try:
mec.get_ids()
except ConnectionRefusedError:
import os, time
os.system('ipcluster -n 2 ')
time.sleep(2)
mec = client.MultiEngineClient(('127.0.0.1',10105))

mec.block = False

parallel = RemoteMultiEngine(mec)

mec.pushAll()

with parallel as pr:
# A comment
remote()  # this means the code below only runs remotely
print 'Hello remote world'
x = range(10)
# Comments are OK
# Even misindented.
y = x+1

print pr.x + pr.y

###

The problem with my approach is that I find it brittle and ugly enough
that I ultimately abandoned it.  I'd love to see if you find a proper
solution for this...

Cheers,

f
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Fwd: GPU Numpy

2009-08-06 Thread Robert Kern
On Thu, Aug 6, 2009 at 19:00, Fernando Perezfperez@gmail.com wrote:
 On Thu, Aug 6, 2009 at 1:57 PM, Sturla Moldenstu...@molden.no wrote:
 In order to reduce the effect of immutable arrays, we could introduce a
 context-manager. Inside the with statement, all arrays would be
 immutable. Second, the __exit__ method could trigger the code generator
 and do all the evaluation. So we would get something like this:

    # normal numpy here

    with numpy.accelerator():

        # arrays become immutable
        # lazy evaluation

        # code generation and evaluation on exit

    # normal numpy continues here


 Thus, here is my plan:

 1. a special context-manager class
 2. immutable arrays inside with statement
 3. lazy evaluation: expressions build up a parse tree
 4. dynamic code generation
 5. evaluation on exit

 You will face one issue here: unless you raise a special exception
 inside the with block, the python interpreter will unconditionally
 execute that code without your control.  I had a long talk about this
 with Alex Martelli last year at scipy, where I pitched the idea of
 allowing context managers to have an optional third method,
 __execute__, which would get the code block in the with statement for
 execution.  He was fairly pessimistic about the possibility of this
 making its way into python, mostly (if I recall correctly) because of
 scoping issues: the with statement does not introduce a new scope, so
 you'd need to pass to this method the code plus the locals/globals of
 the entire enclosing scope, which felt messy.

Sometimes, I fantasize about writing a python4ply grammar that
repurposes the `` quotes to provide expression literals and ``` ```
triple quotes for multiline statement literals. They would be literals
for _ast abstract syntax trees.

-- 
Robert Kern

I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth.
  -- Umberto Eco
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion