> On Oct 12, 2015, at 2:29 PM, Matthew Knepley <[email protected]> wrote:
> 
> On Mon, Oct 12, 2015 at 2:13 PM, Mani Chandra <[email protected]> wrote:
> Hi Karl,
> 
> My motivation was to avoid duplicating code for the CPU and the GPU. This is 
> important considering that it takes a long time to test and make sure the 
> code produces the right results. 
> 
> I guess, I can add a switch in my code with something like:
> 
> if (usingCPU) use VecGetArray()
> 
> else if (usingGPU) use VecViennaCLGetArray()
> 
> and then wrap the pointers that the above functions return with OpenCL 
> buffers with the appropriate memory flags (CL_USE_HOST_PTR for CPU and 
> CL_ALLOC_.. for GPU)
> 
> Hopefully, this will avoid unnecessary data transfers.
> 
> I do not understand this comment at all. This looks crazy to me. The whole 
> point of having Vec
> is so that no one ever ever ever ever does anything like this. I saw nothing 
> in the thread that would
> compel you to do this. What are you trying to accomplish with this switch?

  Matt,

     The current OpenCL code in PETSc is hardwired for GPU usage. So the 
correct fix, I believe, is to add to the VecViennaCL wrappers support for 
either using the GPU or the CPU.

   Barry

> 
>   Matt
>  
> Cheers,
> Mani
> 
> On Sun, Oct 11, 2015 at 1:14 PM, Karl Rupp <[email protected]> wrote:
> Hi Mani,
> 
> > Following http://www.mcs.anl.gov/petsc/petsc-20/conference/Rupp_K.pdf
> (page 16), I ran KSP ex12 for two cases:
> 
> 1) time ./ex12 -m 100 -n 100 -log_summary > log_summary_no_viennacl
> 
> real    0m0.213s
> user    0m0.206s
> sys     0m0.004s
> 
> 2) ./ex12 -m 100 -n 100 -vec_type viennacl -mat_type aijviennacl
> -log_summary > log_summary_with_viennacl
> 
> real    0m20.296s
> user    0m46.025s
> sys     0m1.435s
> 
> The runs have been performed on a CPU : AMD A10-5800K, with OpenCL from
> AMD-APP-SDK-v3.0.
> 
> there are a couple of things to note here:
> 
> a) The total execution time contains the OpenCL kernel compilation time, 
> which is on the order of one or two seconds. Thus, you need much larger 
> problem sizes to get a good comparison.
> 
> b) Most of the execution time is spent on VecMDot, which is optimized for 
> GPUs (CPUs are not an optimization goal in ViennaCL's OpenCL backend because 
> one can use just plain C/C++/whatever).
> 
> c) My experiences with this AMD APU are quite mixed, as I've never found a 
> way to get more than 45% of STREAM bandwidth with OpenCL on the CPU part. The 
> integrated GPU, however, reached 80% without much effort. This is 
> particularly remarkable as both CPU and GPU share the same DDR3 memory link. 
> Thus, it is more than unlikely that you will ever beat the performance of 
> PETSc's native types.
> 
> 
> 
> Attached are:
> 1) configure.log for the petsc build
> 2) log summary without viennacl
> 3) log summary with viennacl
> 4) OpenCL info for the system on which the runs were performed
> 
> Perhaps the reason for the slow performance are superfluous copies being
> performed, which need not occur when running ViennaCL on the CPU.
> Looking at
> http://www.mcs.anl.gov/petsc/petsc-dev/src/vec/vec/impls/seq/seqviennacl/vecviennacl.cxx:
> 
> /* Copies a vector from the CPU to the GPU unless we already have an 
> up-to-date copy on the GPU */
> PetscErrorCode VecViennaCLCopyToGPU(Vec v)
> {
>    PetscErrorCode ierr;
> 
>    PetscFunctionBegin;
>    ierr = VecViennaCLAllocateCheck(v);CHKERRQ(ierr);
>    if (v->map->n > 0) {
>      if (v->valid_GPU_array == PETSC_VIENNACL_CPU) {
>        ierr = PetscLogEventBegin(VEC_ViennaCLCopyToGPU,v,0,0,0);CHKERRQ(ierr);
>        try {
>          ViennaCLVector *vec = ((Vec_ViennaCL*)v->spptr)->GPUarray;
>          viennacl::fast_copy(*(PetscScalar**)v->data, *(PetscScalar**)v->data 
> + v->map->n, vec->begin());
>          ViennaCLWaitForGPU();
>        } catch(std::exception const & ex) {
>          SETERRQ1(PETSC_COMM_SELF,PETSC_ERR_LIB,"ViennaCL error: %s", 
> ex.what());
>        }
>        ierr = PetscLogEventEnd(VEC_ViennaCLCopyToGPU,v,0,0,0);CHKERRQ(ierr);
>        v->valid_GPU_array = PETSC_VIENNACL_BOTH;
>      }
>    }
>    PetscFunctionReturn(0);
> }
> 
> When running ViennaCL with OpenCL on the CPU, the above function should
> maybe be modified?
> 
> Unfortunately that is quite hard: OpenCL manages its own memory handles, so 
> 'injecting' memory into an OpenCL kernel that is not allocated by the OpenCL 
> runtime is not recommended, fairly tricky, and still involves some overhead. 
> As I see no reason to run OpenCL on a CPU, I refrained from adding this extra 
> code complexity.
> 
> Overall, I recommend rerunning the benchmark on more powerful discrete GPUs 
> with GDDR5 (or on-chip memory). Otherwise you won't see any performance 
> benefits.
> 
> Hope this sheds some light on things :-)
> 
> Best regards,
> Karli
> 
> 
> 
> 
> 
> -- 
> What most experimenters take for granted before they begin their experiments 
> is infinitely more interesting than any results to which their experiments 
> lead.
> -- Norbert Wiener

Reply via email to