Dave, I cannot explain why it does not use the MatMult_SeqAIJCusp() it does for me.
Have you updated to the latest cusp/thrust? From the mecurial repositories There is a difference, in your new 4.0 build you added --download-txpetscgpu=yes BTW: that doesn't work for me with the latest cusp and thrust from the mecurial repositories can you try reconfiguring and making without that? Barry On Oct 2, 2011, at 9:08 PM, Dave Nystrom wrote: > Barry Smith writes: >> On Oct 2, 2011, at 6:39 PM, Dave Nystrom wrote: >> >>> Thanks for the update. I don't believe I have gotten a run with good >>> performance yet, either from C or Fortran. I wish there was an easy way for >>> me to force use of only one of my gpus. I don't want to have to pull one of >>> the gpus in order to see if that is complicating things with Cuda 4.0. I'll >>> be eager to hear if you make any progress on figuring things out. >>> >>> Do you understand yet why the petsc ex2.c example is able to parse the >>> "-cuda_show_devices" argument but ex2f.F does not? >> >> Matt put the code in the wrong place in PETSc, that is all, no big >> existentialist reason. I will fix that. > > Thanks. I'll look forward to testing out the new version. > > Dave > >> Barry >> >>> >>> Thanks, >>> >>> Dave >>> >>> Barry Smith writes: >>>> It is not doing the MatMult operation on the GPU and hence needs to move >>>> the vectors back and forth for each operation (since MatMult is done on >>>> the CPU with the vector while vector operations are done on the GPU) hence >>>> the terrible performance. >>>> >>>> Not sure why yet. It is copying the Mat down for me from C. >>>> >>>> Barry >>>> >>>> On Oct 2, 2011, at 4:18 PM, Dave Nystrom wrote: >>>> >>>>> In case it might be useful, I have attached two log files of runs with the >>>>> ex2f petsc example from src/ksp/ksp/examples/tutorials. One was run back >>>>> in >>>>> April with petsc-dev linked to Cuda 3.2. It shows excellent runtime >>>>> performance. The other was run today with petsc-dev checked out of the >>>>> mercurial repo yesterday morning and linked to Cuda 4.0. In addition to >>>>> the >>>>> differences in run time performance, I also do not see an entry for >>>>> MatCUSPCopyTo in the profiling section. I'm not sure what the >>>>> significance >>>>> of that is. I do observe that the run time for PCApply is about the same >>>>> for >>>>> the two cases. I think I would expect that to be the case even if the >>>>> problem were partitioned across two gpus. However, it does make me >>>>> wonder if >>>>> the absence of MatCUSPCopyTo in the profiling section of the Cuda 4.0 log >>>>> file is an indication that the matrix was not actually copied to the gpu. >>>>> I'm not sure yet how to check for that. Hope this might be useful. >>>>> >>>>> Thanks, >>>>> >>>>> Dave >>>>> >>>>> >>>>> <ex2f_3200_3200_cuda_yes_cuda_3.2.log><ex2f_3200_3200_cuda_yes_cuda_4.0.log> >>>>> Dave Nystrom writes: >>>>>> Matthew Knepley writes: >>>>>>> On Sat, Oct 1, 2011 at 11:26 PM, Dave Nystrom <Dave.Nystrom at >>>>>>> tachyonlogic.com> wrote: >>>>>>>> Barry Smith writes: >>>>>>>>> On Oct 1, 2011, at 9:22 PM, Dave Nystrom wrote: >>>>>>>>>> Hi Barry, >>>>>>>>>> >>>>>>>>>> I've sent a couple more emails on this topic. What I am trying to >>>>>>>>>> do at the >>>>>>>>>> moment is to figure out how to have a problem run on only one gpu if >>>>>>>>>> it will >>>>>>>>>> fit in the memory of that gpu. Back in April when I had built >>>>>>>>>> petsc-dev with >>>>>>>>>> Cuda 3.2, petsc would only use one gpu if you had multiple gpus on >>>>>>>>>> your >>>>>>>>>> machine. In order to use multiple gpus for a problem, one had to use >>>>>>>>>> multiple threads with a separate thread assigned to control each >>>>>>>>>> gpu. But >>>>>>>>>> Cuda 4.0 has, I believe, made that transparent and under the hood. >>>>>>>>>> So now >>>>>>>>>> when I run a small example problem such as >>>>>>>>>> src/ksp/ksp/examples/tutorials/ex2f.F with an 800x800 problem, it >>>>>>>>>> gets >>>>>>>>>> partitioned to run on both of the gpus in my machine. The result is >>>>>>>>>> a very >>>>>>>>>> large performance hit because of communication back and forth from >>>>>>>>>> one gpu to >>>>>>>>>> the other via the cpu. >>>>>>>>> >>>>>>>>> How do you know there is lots of communication from the GPU to the >>>>>>>>> CPU? In >>>>>>>>> the -log_summary? Nope because PETSc does not manage anything like >>>>>>>>> that >>>>>>>>> (that is one CPU process using both GPUs). >>>>>>>> >>>>>>>> What I believe is that it is being managed by Cuda 4.0, not by petsc. >>>>>>>> >>>>>>>>>> So this problem with a 3200x3200 grid runs 5x slower >>>>>>>>>> now than it did with Cuda 3.2. I believe if one is programming down >>>>>>>>>> at the >>>>>>>>>> cuda level, it is possible to have a smaller problem run on only one >>>>>>>>>> gpu so >>>>>>>>>> that there is communication only between the cpu and gpu and only at >>>>>>>>>> the >>>>>>>>>> start and end of the calculation. >>>>>>>>>> >>>>>>>>>> To me, it seems like what is needed is a petsc option to specify the >>>>>>>>>> number >>>>>>>>>> of gpus to run on that can somehow get passed down to the cuda level >>>>>>>>>> through >>>>>>>>>> cusp and thrust. I fear that the short term solution is going to >>>>>>>>>> have to be >>>>>>>>>> for me to pull one of the gpus out of my desktop system but it would >>>>>>>>>> be nice >>>>>>>>>> if there was a way to tell petsc and friends to just use one gpu >>>>>>>>>> when I want >>>>>>>>>> it to. >>>>>>>>>> >>>>>>>>>> If necessary, I can send a couple of log files to demonstrate what I >>>>>>>>>> am >>>>>>>>>> trying to describe regarding the performance hit. >>>>>>>>> >>>>>>>>> I am not convinced that the poor performance you are getting now has >>>>>>>>> anything to do with using both GPUs. Please run a PETSc program with >>>>>>>>> the >>>>>>>>> command -cuda_show_devices >>>>>>>> >>>>>>>> I ran the following command: >>>>>>>> >>>>>>>> ex2f -m 8 -n 8 -ksp_type cg -pc_type jacobi -log_summary >>>>>>>> -cuda_show_devices >>>>>>>> -mat_type aijcusp -vec_type cusp -options_left >>>>>>>> >>>>>>>> The result was a report that there was one option left, that being >>>>>>>> -cuda_show_devices. I am using a copy of petsc-dev that I cloned and >>>>>>>> built >>>>>>>> this morning. >>>>>>> >>>>>>> What do you have at src/sys/object/pinit.c:825? You should see the code >>>>>>> that processes this option. You should be able to break there in the >>>>>>> debugger and see what happens. This sounds again like you are not >>>>>>> processing options correctly. >>>>>> >>>>>> Hi Matt, >>>>>> >>>>>> I'll take a look at that in a bit and see if I can figure out what is >>>>>> going >>>>>> on. I do see the code that you mention that processes the arguments that >>>>>> Barry mentioned. In terms of processing options correctly, at least in >>>>>> this >>>>>> case I am actually running one of the petsc examples rather than my own >>>>>> code. And it seems to correctly process the other command line >>>>>> arguments. >>>>>> Anyway, I'll write more after I have had a chance to investigate more. >>>>>> >>>>>> Thanks, >>>>>> >>>>>> Dave >>>>>> >>>>>>> Matt >>>>>>> >>>>>>>>> What are the choices? You can then pick one of them and run with >>>>>>>> -cuda_set_device integer >>>>>>>> >>>>>>>> The -cuda_set_device option does not appear to be recognized either, >>>>>>>> even >>>>>>>> if I choose an integer like 0. >>>>>>>> >>>>>>>>> Does this change things? >>>>>>>> >>>>>>>> I suspect it would change things if I could get it to work. >>>>>>>> >>>>>>>> Thanks, >>>>>>>> >>>>>>>> Dave >>>>>>>> >>>>>>>>> Barry >>>>>>>>> >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> >>>>>>>>>> Dave >>>>>>>>>> >>>>>>>>>> Barry Smith writes: >>>>>>>>>>> Dave, >>>>>>>>>>> >>>>>>>>>>> We have no mechanism in the PETSc code for a PETSc single CPU >>>>>>>>>>> process to >>>>>>>>>>> use two GPUs at the same time. However you could have two MPI >>>>>>>>>>> processes >>>>>>>>>>> each using their own GPU. >>>>>>>>>>> >>>>>>>>>>> The one tricky part is you need to make sure each MPI process uses a >>>>>>>>>>> different GPU. We currently do not have a mechanism to do this >>>>>>>>>>> assignment >>>>>>>>>>> automatically. I think it can be done with cudaSetDevice(). But I >>>>>>>>>>> don't >>>>>>>>>>> know the details, sending this to petsc-dev at mcs.anl.gov where >>>>>>>>>>> more people >>>>>>>>>>> may know. >>>>>>>>>>> >>>>>>>>>>> PETSc-folks, >>>>>>>>>>> >>>>>>>>>>> We need a way to have this setup automatically. >>>>>>>>>>> >>>>>>>>>>> Barry >>>>>>>>>>> >>>>>>>>>>> On Oct 1, 2011, at 5:43 PM, Dave Nystrom wrote: >>>>>>>>>>> >>>>>>>>>>>> I'm running petsc on a machine with Cuda 4.0 and 2 gpus. This is >>>>>>>>>>>> a desktop >>>>>>>>>>>> machine with a single processor. I know that Cuda 4.0 has support >>>>>>>>>>>> for >>>>>>>>>>>> running on multiple gpus but don't know if petsc uses that. But >>>>>>>>>>>> suppose I >>>>>>>>>>>> have a problem that will fit in the memory for a single gpu. Will >>>>>>>>>>>> petsc run >>>>>>>>>>>> the problem on a single gpu or does it split it between the 2 gpus >>>>>>>>>>>> and incur >>>>>>>>>>>> the communication overhead of copying data between the two gpus? >>>>>>>>>>>> >>>>>>>>>>>> Thanks, >>>>>>>>>>>> >>>>>>>>>>>> Dave >>>>>>> >>>>>>> -- >>>>>>> What most experimenters take for granted before they begin their >>>>>>> experiments >>>>>>> is infinitely more interesting than any results to which their >>>>>>> experiments >>>>>>> lead. >>>>>>> -- Norbert Wiener >>>> >>