Dnia 2010-09-28, wto o godzinie 23:56 +0200, Tomasz Rybak pisze:
> Dnia 2010-09-28, wto o godzinie 00:29 -0700, jmcarval pisze:
> > Thanks for your reply.
> > I've read the first thread you mention, that ends without a solution
> > http://pycuda.2962900.n2.nabble.com/PyCUDA-pycuda-test-failures-tp5320194p5320194.html
> > 
> > Maybe I'm doing a huge mistake but it does not seem to be a precision
> > detail.
> > The following code (a simplification of test_gpuarray), returns 30 from the
> > CPU and 14 from the GTX480, either with integer, float32 or float64.
> > I don't get it. Can anybody explain me what I'm doing wrong please?
> > Thanks
> > 
> > import pycuda.autoinit
> > import numpy
> > import pycuda.gpuarray as gpuarray
> > from pycuda.curandom import rand as curand
> > 
> > a = numpy.array([1,2,3,4])#.astype(numpy.float32)
> > a_gpu = gpuarray.to_gpu(a)
> > b = a
> > b_gpu = gpuarray.to_gpu(b)
> > 
> > dot_ab = numpy.dot(a, b)
> > 
> > dot_ab_gpu = gpuarray.dot(a_gpu, b_gpu).get()
> > 
> > print "CPU dot product:", dot_ab
> > print "GPU dot product:", dot_ab_gpu
> > 
> > 
> 
> I have idea for (maybe) checking whether problem is with PyCUDA,
> CUDA toolkit, or driver.
> Can you force PyCUDA to generate not sm_20 code, but 1x?
> I have found that it is determined in line 190 of file
> pycuda/compiler.py:
> arch = "sm_%d%d" % Context.get_device().compute_capability()
> Try to change it to
> arch = "sm_10"
> and so on, and check whether you get incorrect 14 in such
> a case.
> 
> If there is simpler way of changing architecture to which
> PyCUDA generates code, feel free to use it and share this
> information.

Unrelated to previous analysis, but it also might be important.
While looking at reduction kernels I have noticed that 
for most of the reductions (in the later phase) threads are
synchronised to ensure that all values have been computed
and stored in shared memory (array sdata) - lines 106-121.
Then, for the last 64 values reduction follows, but there is no
synchronisation in lines 123-132.

Can someone check whether one of attached synchronize.diff helps?
I have read that Fermi have more aggressive cache, and
threads can be called slightly differently (do not remember
details). Another change in Fermi is that the same memory
can be used for L1 cache or shared memory, and ReductionKernel
uses shared memory for storing intermediate results of reductions.

Maybe (just a intuition) more aggressive synchronisation can help here.

I have also just found in Fermi compatibility guide 1.2.2 that
there is much more aggressive optimisation; I am attaching patch
volatile.diff (made according to guide) - please check if it solves
problem. If it does, as it is based on official guide, I would
propose using it (instead of synchronize.diff).

Regards and good night.

-- 
Tomasz Rybak <bogom...@post.pl> GPG/PGP key ID: 2AD5 9860
Fingerprint A481 824E 7DD3 9C0E C40A  488E C654 FB33 2AD5 9860
http://member.acm.org/~tomaszrybak
diff --git a/pycuda/reduction.py b/pycuda/reduction.py
index 0a17508..92f48d1 100644
--- a/pycuda/reduction.py
+++ b/pycuda/reduction.py
@@ -123,11 +123,17 @@ def get_reduction_module(out_type, block_size,
           if (tid < 32) 
           {
             if (BLOCK_SIZE >= 64) sdata[tid] = REDUCE(sdata[tid], sdata[tid + 32]);
+            __syncthreads(); 
             if (BLOCK_SIZE >= 32) sdata[tid] = REDUCE(sdata[tid], sdata[tid + 16]);
+            __syncthreads(); 
             if (BLOCK_SIZE >= 16) sdata[tid] = REDUCE(sdata[tid], sdata[tid + 8]);
+            __syncthreads(); 
             if (BLOCK_SIZE >= 8)  sdata[tid] = REDUCE(sdata[tid], sdata[tid + 4]);
+            __syncthreads(); 
             if (BLOCK_SIZE >= 4)  sdata[tid] = REDUCE(sdata[tid], sdata[tid + 2]);
+            __syncthreads(); 
             if (BLOCK_SIZE >= 2)  sdata[tid] = REDUCE(sdata[tid], sdata[tid + 1]);
+            __syncthreads(); 
           }
 
           if (tid == 0) out[blockIdx.x] = sdata[0];
diff --git a/pycuda/reduction.py b/pycuda/reduction.py
index 0a17508..0e1eb3c 100644
--- a/pycuda/reduction.py
+++ b/pycuda/reduction.py
@@ -122,12 +122,14 @@ def get_reduction_module(out_type, block_size,
 
           if (tid < 32) 
           {
-            if (BLOCK_SIZE >= 64) sdata[tid] = REDUCE(sdata[tid], sdata[tid + 32]);
-            if (BLOCK_SIZE >= 32) sdata[tid] = REDUCE(sdata[tid], sdata[tid + 16]);
-            if (BLOCK_SIZE >= 16) sdata[tid] = REDUCE(sdata[tid], sdata[tid + 8]);
-            if (BLOCK_SIZE >= 8)  sdata[tid] = REDUCE(sdata[tid], sdata[tid + 4]);
-            if (BLOCK_SIZE >= 4)  sdata[tid] = REDUCE(sdata[tid], sdata[tid + 2]);
-            if (BLOCK_SIZE >= 2)  sdata[tid] = REDUCE(sdata[tid], sdata[tid + 1]);
+// Fermi compatibility guide 1.2.2
+            volatile out_type *smem = sdata;
+            if (BLOCK_SIZE >= 64) smem[tid] = REDUCE(smem[tid], smem[tid + 32]);
+            if (BLOCK_SIZE >= 32) smem[tid] = REDUCE(smem[tid], smem[tid + 16]);
+            if (BLOCK_SIZE >= 16) smem[tid] = REDUCE(smem[tid], smem[tid + 8]);
+            if (BLOCK_SIZE >= 8)  smem[tid] = REDUCE(smem[tid], smem[tid + 4]);
+            if (BLOCK_SIZE >= 4)  smem[tid] = REDUCE(smem[tid], smem[tid + 2]);
+            if (BLOCK_SIZE >= 2)  smem[tid] = REDUCE(smem[tid], smem[tid + 1]);
           }
 
           if (tid == 0) out[blockIdx.x] = sdata[0];

Attachment: signature.asc
Description: This is a digitally signed message part

_______________________________________________
PyCUDA mailing list
PyCUDA@tiker.net
http://lists.tiker.net/listinfo/pycuda

Reply via email to