On Mon, Oct 12, 2015 at 2:36 PM, Barry Smith <[email protected]> wrote:
> > > On Oct 12, 2015, at 2:29 PM, Matthew Knepley <[email protected]> wrote: > > > > On Mon, Oct 12, 2015 at 2:13 PM, Mani Chandra <[email protected]> wrote: > > Hi Karl, > > > > My motivation was to avoid duplicating code for the CPU and the GPU. > This is important considering that it takes a long time to test and make > sure the code produces the right results. > > > > I guess, I can add a switch in my code with something like: > > > > if (usingCPU) use VecGetArray() > > > > else if (usingGPU) use VecViennaCLGetArray() > > > > and then wrap the pointers that the above functions return with OpenCL > buffers with the appropriate memory flags (CL_USE_HOST_PTR for CPU and > CL_ALLOC_.. for GPU) > > > > Hopefully, this will avoid unnecessary data transfers. > > > > I do not understand this comment at all. This looks crazy to me. The > whole point of having Vec > > is so that no one ever ever ever ever does anything like this. I saw > nothing in the thread that would > > compel you to do this. What are you trying to accomplish with this > switch? > > Matt, > > The current OpenCL code in PETSc is hardwired for GPU usage. So the > correct fix, I believe, is to add to the VecViennaCL wrappers support for > either using the GPU or the CPU. > Yes, that is an option. I thought the upshot of Karl's mail was that while this is possible, OpenCL CPU performance is woeful and unlikely to improve, and a better option is to use the current code with multiple MPI processes and the PETSc type mechanism. Matt > Barry > > > > > Matt > > > > Cheers, > > Mani > > > > On Sun, Oct 11, 2015 at 1:14 PM, Karl Rupp <[email protected]> > wrote: > > Hi Mani, > > > > > Following http://www.mcs.anl.gov/petsc/petsc-20/conference/Rupp_K.pdf > > (page 16), I ran KSP ex12 for two cases: > > > > 1) time ./ex12 -m 100 -n 100 -log_summary > log_summary_no_viennacl > > > > real 0m0.213s > > user 0m0.206s > > sys 0m0.004s > > > > 2) ./ex12 -m 100 -n 100 -vec_type viennacl -mat_type aijviennacl > > -log_summary > log_summary_with_viennacl > > > > real 0m20.296s > > user 0m46.025s > > sys 0m1.435s > > > > The runs have been performed on a CPU : AMD A10-5800K, with OpenCL from > > AMD-APP-SDK-v3.0. > > > > there are a couple of things to note here: > > > > a) The total execution time contains the OpenCL kernel compilation time, > which is on the order of one or two seconds. Thus, you need much larger > problem sizes to get a good comparison. > > > > b) Most of the execution time is spent on VecMDot, which is optimized > for GPUs (CPUs are not an optimization goal in ViennaCL's OpenCL backend > because one can use just plain C/C++/whatever). > > > > c) My experiences with this AMD APU are quite mixed, as I've never found > a way to get more than 45% of STREAM bandwidth with OpenCL on the CPU part. > The integrated GPU, however, reached 80% without much effort. This is > particularly remarkable as both CPU and GPU share the same DDR3 memory > link. Thus, it is more than unlikely that you will ever beat the > performance of PETSc's native types. > > > > > > > > Attached are: > > 1) configure.log for the petsc build > > 2) log summary without viennacl > > 3) log summary with viennacl > > 4) OpenCL info for the system on which the runs were performed > > > > Perhaps the reason for the slow performance are superfluous copies being > > performed, which need not occur when running ViennaCL on the CPU. > > Looking at > > > http://www.mcs.anl.gov/petsc/petsc-dev/src/vec/vec/impls/seq/seqviennacl/vecviennacl.cxx > : > > > > /* Copies a vector from the CPU to the GPU unless we already have an > up-to-date copy on the GPU */ > > PetscErrorCode VecViennaCLCopyToGPU(Vec v) > > { > > PetscErrorCode ierr; > > > > PetscFunctionBegin; > > ierr = VecViennaCLAllocateCheck(v);CHKERRQ(ierr); > > if (v->map->n > 0) { > > if (v->valid_GPU_array == PETSC_VIENNACL_CPU) { > > ierr = > PetscLogEventBegin(VEC_ViennaCLCopyToGPU,v,0,0,0);CHKERRQ(ierr); > > try { > > ViennaCLVector *vec = ((Vec_ViennaCL*)v->spptr)->GPUarray; > > viennacl::fast_copy(*(PetscScalar**)v->data, > *(PetscScalar**)v->data + v->map->n, vec->begin()); > > ViennaCLWaitForGPU(); > > } catch(std::exception const & ex) { > > SETERRQ1(PETSC_COMM_SELF,PETSC_ERR_LIB,"ViennaCL error: %s", > ex.what()); > > } > > ierr = > PetscLogEventEnd(VEC_ViennaCLCopyToGPU,v,0,0,0);CHKERRQ(ierr); > > v->valid_GPU_array = PETSC_VIENNACL_BOTH; > > } > > } > > PetscFunctionReturn(0); > > } > > > > When running ViennaCL with OpenCL on the CPU, the above function should > > maybe be modified? > > > > Unfortunately that is quite hard: OpenCL manages its own memory handles, > so 'injecting' memory into an OpenCL kernel that is not allocated by the > OpenCL runtime is not recommended, fairly tricky, and still involves some > overhead. As I see no reason to run OpenCL on a CPU, I refrained from > adding this extra code complexity. > > > > Overall, I recommend rerunning the benchmark on more powerful discrete > GPUs with GDDR5 (or on-chip memory). Otherwise you won't see any > performance benefits. > > > > Hope this sheds some light on things :-) > > > > Best regards, > > Karli > > > > > > > > > > > > -- > > What most experimenters take for granted before they begin their > experiments is infinitely more interesting than any results to which their > experiments lead. > > -- Norbert Wiener > > -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener
