[petsc-dev] [petsc-maint #88993] Petsc with Cuda 4.0 and Multiple GPUs

Barry Smith Sun, 2 Oct 2011 21:43:28 -0500

  Dave,

   I cannot explain why it does not use the MatMult_SeqAIJCusp() it does for me.


    Have you updated to the latest cusp/thrust? From the mecurial repositories

    There is a difference, in your new 4.0 build you added 
--download-txpetscgpu=yes BTW: that doesn't work for me with the latest cusp 
and thrust from the mecurial repositories can you try reconfiguring and making 
without that?


    Barry


On Oct 2, 2011, at 9:08 PM, Dave Nystrom wrote:

> Barry Smith writes:
>> On Oct 2, 2011, at 6:39 PM, Dave Nystrom wrote:
>> 
>>> Thanks for the update.  I don't believe I have gotten a run with good
>>> performance yet, either from C or Fortran.  I wish there was an easy way for
>>> me to force use of only one of my gpus.  I don't want to have to pull one of
>>> the gpus in order to see if that is complicating things with Cuda 4.0.  I'll
>>> be eager to hear if you make any progress on figuring things out.
>>> 
>>> Do you understand yet why the petsc ex2.c example is able to parse the
>>> "-cuda_show_devices" argument but ex2f.F does not?
>> 
>> Matt put the code in the wrong place in PETSc, that is all, no big
>> existentialist reason. I will fix that.
> 
> Thanks.  I'll look forward to testing out the new version.
> 
> Dave
> 
>> Barry
>> 
>>> 
>>> Thanks,
>>> 
>>> Dave
>>> 
>>> Barry Smith writes:
>>>> It is not doing the MatMult operation on the GPU and hence needs to move
>>>> the vectors back and forth for each operation (since MatMult is done on
>>>> the CPU with the vector while vector operations are done on the GPU) hence
>>>> the terrible performance.
>>>> 
>>>> Not sure why yet. It is copying the Mat down for me from C.
>>>> 
>>>> Barry
>>>> 
>>>> On Oct 2, 2011, at 4:18 PM, Dave Nystrom wrote:
>>>> 
>>>>> In case it might be useful, I have attached two log files of runs with the
>>>>> ex2f petsc example from src/ksp/ksp/examples/tutorials.  One was run back 
>>>>> in
>>>>> April with petsc-dev linked to Cuda 3.2.  It shows excellent runtime
>>>>> performance.  The other was run today with petsc-dev checked out of the
>>>>> mercurial repo yesterday morning and linked to Cuda 4.0.  In addition to 
>>>>> the
>>>>> differences in run time performance, I also do not see an entry for
>>>>> MatCUSPCopyTo in the profiling section.  I'm not sure what the 
>>>>> significance
>>>>> of that is.  I do observe that the run time for PCApply is about the same 
>>>>> for
>>>>> the two cases.  I think I would expect that to be the case even if the
>>>>> problem were partitioned across two gpus.  However, it does make me 
>>>>> wonder if
>>>>> the absence of MatCUSPCopyTo in the profiling section of the Cuda 4.0 log
>>>>> file is an indication that the matrix was not actually copied to the gpu.
>>>>> I'm not sure yet how to check for that.  Hope this might be useful.
>>>>> 
>>>>> Thanks,
>>>>> 
>>>>> Dave
>>>>> 
>>>>> 
>>>>> <ex2f_3200_3200_cuda_yes_cuda_3.2.log><ex2f_3200_3200_cuda_yes_cuda_4.0.log>
>>>>> Dave Nystrom writes:
>>>>>> Matthew Knepley writes:
>>>>>>> On Sat, Oct 1, 2011 at 11:26 PM, Dave Nystrom <Dave.Nystrom at 
>>>>>>> tachyonlogic.com> wrote:
>>>>>>>> Barry Smith writes:
>>>>>>>>> On Oct 1, 2011, at 9:22 PM, Dave Nystrom wrote:
>>>>>>>>>> Hi Barry,
>>>>>>>>>> 
>>>>>>>>>> I've sent a couple more emails on this topic.  What I am trying to 
>>>>>>>>>> do at the
>>>>>>>>>> moment is to figure out how to have a problem run on only one gpu if 
>>>>>>>>>> it will
>>>>>>>>>> fit in the memory of that gpu.  Back in April when I had built 
>>>>>>>>>> petsc-dev with
>>>>>>>>>> Cuda 3.2, petsc would only use one gpu if you had multiple gpus on 
>>>>>>>>>> your
>>>>>>>>>> machine.  In order to use multiple gpus for a problem, one had to use
>>>>>>>>>> multiple threads with a separate thread assigned to control each 
>>>>>>>>>> gpu.  But
>>>>>>>>>> Cuda 4.0 has, I believe, made that transparent and under the hood.  
>>>>>>>>>> So now
>>>>>>>>>> when I run a small example problem such as
>>>>>>>>>> src/ksp/ksp/examples/tutorials/ex2f.F with an 800x800 problem, it 
>>>>>>>>>> gets
>>>>>>>>>> partitioned to run on both of the gpus in my machine.  The result is 
>>>>>>>>>> a very
>>>>>>>>>> large performance hit because of communication back and forth from 
>>>>>>>>>> one gpu to
>>>>>>>>>> the other via the cpu.
>>>>>>>>> 
>>>>>>>>> How do you know there is lots of communication from the GPU to the 
>>>>>>>>> CPU? In
>>>>>>>>> the -log_summary? Nope because PETSc does not manage anything like 
>>>>>>>>> that
>>>>>>>>> (that is one CPU process using both GPUs).
>>>>>>>> 
>>>>>>>> What I believe is that it is being managed by Cuda 4.0, not by petsc.
>>>>>>>> 
>>>>>>>>>> So this problem with a 3200x3200 grid runs 5x slower
>>>>>>>>>> now than it did with Cuda 3.2.  I believe if one is programming down 
>>>>>>>>>> at the
>>>>>>>>>> cuda level, it is possible to have a smaller problem run on only one 
>>>>>>>>>> gpu so
>>>>>>>>>> that there is communication only between the cpu and gpu and only at 
>>>>>>>>>> the
>>>>>>>>>> start and end of the calculation.
>>>>>>>>>> 
>>>>>>>>>> To me, it seems like what is needed is a petsc option to specify the 
>>>>>>>>>> number
>>>>>>>>>> of gpus to run on that can somehow get passed down to the cuda level 
>>>>>>>>>> through
>>>>>>>>>> cusp and thrust.  I fear that the short term solution is going to 
>>>>>>>>>> have to be
>>>>>>>>>> for me to pull one of the gpus out of my desktop system but it would 
>>>>>>>>>> be nice
>>>>>>>>>> if there was a way to tell petsc and friends to just use one gpu 
>>>>>>>>>> when I want
>>>>>>>>>> it to.
>>>>>>>>>> 
>>>>>>>>>> If necessary, I can send a couple of log files to demonstrate what I 
>>>>>>>>>> am
>>>>>>>>>> trying to describe regarding the performance hit.
>>>>>>>>> 
>>>>>>>>> I am not convinced that the poor performance you are getting now has
>>>>>>>>> anything to do with using both GPUs. Please run a PETSc program with 
>>>>>>>>> the
>>>>>>>>> command -cuda_show_devices
>>>>>>>> 
>>>>>>>> I ran the following command:
>>>>>>>> 
>>>>>>>> ex2f -m 8 -n 8 -ksp_type cg -pc_type jacobi -log_summary 
>>>>>>>> -cuda_show_devices
>>>>>>>> -mat_type aijcusp -vec_type cusp -options_left
>>>>>>>> 
>>>>>>>> The result was a report that there was one option left, that being
>>>>>>>> -cuda_show_devices.  I am using a copy of petsc-dev that I cloned and 
>>>>>>>> built
>>>>>>>> this morning.
>>>>>>> 
>>>>>>> What do you have at src/sys/object/pinit.c:825? You should see the code
>>>>>>> that processes this option. You should be able to break there in the
>>>>>>> debugger and see what happens. This sounds again like you are not
>>>>>>> processing options correctly.
>>>>>> 
>>>>>> Hi Matt,
>>>>>> 
>>>>>> I'll take a look at that in a bit and see if I can figure out what is 
>>>>>> going
>>>>>> on.  I do see the code that you mention that processes the arguments that
>>>>>> Barry mentioned.  In terms of processing options correctly, at least in 
>>>>>> this
>>>>>> case I am actually running one of the petsc examples rather than my own
>>>>>> code.  And it seems to correctly process the other command line 
>>>>>> arguments.
>>>>>> Anyway, I'll write more after I have had a chance to investigate more.
>>>>>> 
>>>>>> Thanks,
>>>>>> 
>>>>>> Dave
>>>>>> 
>>>>>>> Matt
>>>>>>> 
>>>>>>>>> What are the choices?  You can then pick one of them and run with
>>>>>>>> -cuda_set_device integer
>>>>>>>> 
>>>>>>>> The -cuda_set_device option does not appear to be recognized either, 
>>>>>>>> even
>>>>>>>> if I choose an integer like 0.
>>>>>>>> 
>>>>>>>>> Does this change things?
>>>>>>>> 
>>>>>>>> I suspect it would change things if I could get it to work.
>>>>>>>> 
>>>>>>>> Thanks,
>>>>>>>> 
>>>>>>>> Dave
>>>>>>>> 
>>>>>>>>> Barry
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Thanks,
>>>>>>>>>> 
>>>>>>>>>> Dave
>>>>>>>>>> 
>>>>>>>>>> Barry Smith writes:
>>>>>>>>>>> Dave,
>>>>>>>>>>> 
>>>>>>>>>>> We have no mechanism in the PETSc code for a PETSc single CPU 
>>>>>>>>>>> process to
>>>>>>>>>>> use two GPUs at the same time. However you could have two MPI 
>>>>>>>>>>> processes
>>>>>>>>>>> each using their own GPU.
>>>>>>>>>>> 
>>>>>>>>>>> The one tricky part is you need to make sure each MPI process uses a
>>>>>>>>>>> different GPU. We currently do not have a mechanism to do this 
>>>>>>>>>>> assignment
>>>>>>>>>>> automatically. I think it can be done with cudaSetDevice(). But I 
>>>>>>>>>>> don't
>>>>>>>>>>> know the details, sending this to petsc-dev at mcs.anl.gov where 
>>>>>>>>>>> more people
>>>>>>>>>>> may know.
>>>>>>>>>>> 
>>>>>>>>>>> PETSc-folks,
>>>>>>>>>>> 
>>>>>>>>>>> We need a way to have this setup automatically.
>>>>>>>>>>> 
>>>>>>>>>>> Barry
>>>>>>>>>>> 
>>>>>>>>>>> On Oct 1, 2011, at 5:43 PM, Dave Nystrom wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> I'm running petsc on a machine with Cuda 4.0 and 2 gpus.  This is 
>>>>>>>>>>>> a desktop
>>>>>>>>>>>> machine with a single processor.  I know that Cuda 4.0 has support 
>>>>>>>>>>>> for
>>>>>>>>>>>> running on multiple gpus but don't know if petsc uses that.  But 
>>>>>>>>>>>> suppose I
>>>>>>>>>>>> have a problem that will fit in the memory for a single gpu.  Will 
>>>>>>>>>>>> petsc run
>>>>>>>>>>>> the problem on a single gpu or does it split it between the 2 gpus 
>>>>>>>>>>>> and incur
>>>>>>>>>>>> the communication overhead of copying data between the two gpus?
>>>>>>>>>>>> 
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> 
>>>>>>>>>>>> Dave
>>>>>>> 
>>>>>>> -- 
>>>>>>> What most experimenters take for granted before they begin their 
>>>>>>> experiments
>>>>>>> is infinitely more interesting than any results to which their 
>>>>>>> experiments
>>>>>>> lead.
>>>>>>> -- Norbert Wiener
>>>> 
>>

[petsc-dev] [petsc-maint #88993] Petsc with Cuda 4.0 and Multiple GPUs

Reply via email to