On Mon, Oct 12, 2015 at 3:25 PM, Mani Chandra <[email protected]> wrote:
> Here is the code: http://github.com/afd-illinois/grim, branch:opencl. > > Now using > > Kernel Builder for OpenCL API - compiler command line, version 1.4.0.134 > Copyright (C) 2014 Intel Corporation. All rights reserved. > > > manic@bh27:~/grim_opencl/grim> ioc64 -input=computeresidual.cl > -bo='-DOPENCL' > -device='cpu' > > No command specified, using 'build' as default > Sorry if I am being obtuse, but I cannot find that source file in the repo above. Can you give the direct link to the file? > Using build options: -DOPENCL > Setting target instruction set architecture to: Default (Advanced Vector > Extension (AVX)) > OpenCL Intel CPU device was found! > Device name: Intel(R) Xeon(R) CPU E5-2420 v2 @ 2.20GHz > Device version: OpenCL 1.2 (Build 44) > Device vendor: Intel(R) Corporation > Device profile: FULL_PROFILE > Compilation started > Compilation done > Linking started > Linking done > Device build started > Device build done > Kernel <ComputeResidual> was successfully vectorized > Done. > Build succeeded! > It definitely says it vectorized, but what code did it generate. Can you post the object file since I do not have the compiler. I have seen that message with really bad code before. Thanks, Matt > Cheers, > Mani > > On Mon, Oct 12, 2015 at 12:52 PM, Matthew Knepley <[email protected]> > wrote: > >> On Mon, Oct 12, 2015 at 2:44 PM, Mani Chandra <[email protected]> wrote: >>> >>> On Mon, Oct 12, 2015 at 12:36 PM, Barry Smith <[email protected]> >>> wrote: >>> >>>> >>>> > On Oct 12, 2015, at 2:29 PM, Matthew Knepley <[email protected]> >>>> wrote: >>>> > >>>> > On Mon, Oct 12, 2015 at 2:13 PM, Mani Chandra <[email protected]> >>>> wrote: >>>> > Hi Karl, >>>> > >>>> > My motivation was to avoid duplicating code for the CPU and the GPU. >>>> This is important considering that it takes a long time to test and make >>>> sure the code produces the right results. >>>> > >>>> > I guess, I can add a switch in my code with something like: >>>> > >>>> > if (usingCPU) use VecGetArray() >>>> > >>>> > else if (usingGPU) use VecViennaCLGetArray() >>>> > >>>> > and then wrap the pointers that the above functions return with >>>> OpenCL buffers with the appropriate memory flags (CL_USE_HOST_PTR for CPU >>>> and CL_ALLOC_.. for GPU) >>>> > >>>> > Hopefully, this will avoid unnecessary data transfers. >>>> > >>>> > I do not understand this comment at all. This looks crazy to me. The >>>> whole point of having Vec >>>> > is so that no one ever ever ever ever does anything like this. I saw >>>> nothing in the thread that would >>>> > compel you to do this. What are you trying to accomplish with this >>>> switch? >>>> >>>> >>> I'm trying to assemble the residual needed for SNES using an OpenCL >>> kernel. The kernel operates on OpenCL buffers which can either live on the >>> CPU or the GPU. >>> >>> I think it is useful to use OpenCL on the CPU basically because of >>> vectorization and vector data types. If I had to write usual C code, I'd >>> have to use all sorts of pragmas in icc to get the code to vectorize and >>> even then its pretty hard. >>> >> >> I would completely agree with you, if I thought the compiler actually >> vectorized that code. I do not think that >> is the case. Is there an example you have where you get vectorized >> assembly? >> >> Thanks, >> >> Matt >> >> >>> Mani >>> >>> >>>> Matt, >>>> >>>> The current OpenCL code in PETSc is hardwired for GPU usage. So >>>> the correct fix, I believe, is to add to the VecViennaCL wrappers support >>>> for either using the GPU or the CPU. >>>> >>>> Barry >>>> >>>> > >>>> > Matt >>>> > >>>> > Cheers, >>>> > Mani >>>> > >>>> > On Sun, Oct 11, 2015 at 1:14 PM, Karl Rupp <[email protected]> >>>> wrote: >>>> > Hi Mani, >>>> > >>>> > > Following >>>> http://www.mcs.anl.gov/petsc/petsc-20/conference/Rupp_K.pdf >>>> > (page 16), I ran KSP ex12 for two cases: >>>> > >>>> > 1) time ./ex12 -m 100 -n 100 -log_summary > log_summary_no_viennacl >>>> > >>>> > real 0m0.213s >>>> > user 0m0.206s >>>> > sys 0m0.004s >>>> > >>>> > 2) ./ex12 -m 100 -n 100 -vec_type viennacl -mat_type aijviennacl >>>> > -log_summary > log_summary_with_viennacl >>>> > >>>> > real 0m20.296s >>>> > user 0m46.025s >>>> > sys 0m1.435s >>>> > >>>> > The runs have been performed on a CPU : AMD A10-5800K, with OpenCL >>>> from >>>> > AMD-APP-SDK-v3.0. >>>> > >>>> > there are a couple of things to note here: >>>> > >>>> > a) The total execution time contains the OpenCL kernel compilation >>>> time, which is on the order of one or two seconds. Thus, you need much >>>> larger problem sizes to get a good comparison. >>>> > >>>> > b) Most of the execution time is spent on VecMDot, which is optimized >>>> for GPUs (CPUs are not an optimization goal in ViennaCL's OpenCL backend >>>> because one can use just plain C/C++/whatever). >>>> > >>>> > c) My experiences with this AMD APU are quite mixed, as I've never >>>> found a way to get more than 45% of STREAM bandwidth with OpenCL on the CPU >>>> part. The integrated GPU, however, reached 80% without much effort. This is >>>> particularly remarkable as both CPU and GPU share the same DDR3 memory >>>> link. Thus, it is more than unlikely that you will ever beat the >>>> performance of PETSc's native types. >>>> > >>>> > >>>> > >>>> > Attached are: >>>> > 1) configure.log for the petsc build >>>> > 2) log summary without viennacl >>>> > 3) log summary with viennacl >>>> > 4) OpenCL info for the system on which the runs were performed >>>> > >>>> > Perhaps the reason for the slow performance are superfluous copies >>>> being >>>> > performed, which need not occur when running ViennaCL on the CPU. >>>> > Looking at >>>> > >>>> http://www.mcs.anl.gov/petsc/petsc-dev/src/vec/vec/impls/seq/seqviennacl/vecviennacl.cxx >>>> : >>>> > >>>> > /* Copies a vector from the CPU to the GPU unless we already have an >>>> up-to-date copy on the GPU */ >>>> > PetscErrorCode VecViennaCLCopyToGPU(Vec v) >>>> > { >>>> > PetscErrorCode ierr; >>>> > >>>> > PetscFunctionBegin; >>>> > ierr = VecViennaCLAllocateCheck(v);CHKERRQ(ierr); >>>> > if (v->map->n > 0) { >>>> > if (v->valid_GPU_array == PETSC_VIENNACL_CPU) { >>>> > ierr = >>>> PetscLogEventBegin(VEC_ViennaCLCopyToGPU,v,0,0,0);CHKERRQ(ierr); >>>> > try { >>>> > ViennaCLVector *vec = ((Vec_ViennaCL*)v->spptr)->GPUarray; >>>> > viennacl::fast_copy(*(PetscScalar**)v->data, >>>> *(PetscScalar**)v->data + v->map->n, vec->begin()); >>>> > ViennaCLWaitForGPU(); >>>> > } catch(std::exception const & ex) { >>>> > SETERRQ1(PETSC_COMM_SELF,PETSC_ERR_LIB,"ViennaCL error: %s", >>>> ex.what()); >>>> > } >>>> > ierr = >>>> PetscLogEventEnd(VEC_ViennaCLCopyToGPU,v,0,0,0);CHKERRQ(ierr); >>>> > v->valid_GPU_array = PETSC_VIENNACL_BOTH; >>>> > } >>>> > } >>>> > PetscFunctionReturn(0); >>>> > } >>>> > >>>> > When running ViennaCL with OpenCL on the CPU, the above function >>>> should >>>> > maybe be modified? >>>> > >>>> > Unfortunately that is quite hard: OpenCL manages its own memory >>>> handles, so 'injecting' memory into an OpenCL kernel that is not allocated >>>> by the OpenCL runtime is not recommended, fairly tricky, and still involves >>>> some overhead. As I see no reason to run OpenCL on a CPU, I refrained from >>>> adding this extra code complexity. >>>> > >>>> > Overall, I recommend rerunning the benchmark on more powerful >>>> discrete GPUs with GDDR5 (or on-chip memory). Otherwise you won't see any >>>> performance benefits. >>>> > >>>> > Hope this sheds some light on things :-) >>>> > >>>> > Best regards, >>>> > Karli >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > -- >>>> > What most experimenters take for granted before they begin their >>>> experiments is infinitely more interesting than any results to which their >>>> experiments lead. >>>> > -- Norbert Wiener >>>> >>>> >>> >> >> >> -- >> What most experimenters take for granted before they begin their >> experiments is infinitely more interesting than any results to which their >> experiments lead. >> -- Norbert Wiener >> > > -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener
