Re: [petsc-dev] Gitlab workflow discussion with GitLab developers

2022-01-20 Thread Barry Smith


> On Jan 20, 2022, at 10:40 PM, Junchao Zhang  wrote:
> 
> *  Email notification when one is mentioned or added as a reviewer

   Hmm, I get emails on these? I don't get email saying I am code owner for a MR

> *  Color text in comment box
> *  Click a failed job, run the job with the updated branch
> *  Allow one to reorder commits (e.g., the fix up commits generated from 
> applying comments) and mark commits that should be fixed up
> *  Easily retarget a branch, e.g., from main to release (currently I have to 
> checkout to local machine, do rebase, then push)   
> 
> --Junchao Zhang
> 
> 
> On Thu, Jan 20, 2022 at 7:05 PM Barry Smith  > wrote:
> 
>   I got asked to go over some of my Gitlab workflow uses next week with some 
> Gitlab developers; they do this to understand how Gitlab is used, how it can 
> be improved etc. 
> 
>   If anyone has ideas on topics I should hit, let me know. I will hit them on 
> the brokenness of appropriate code-owners not being automatically added to 
> reviewers. And support for people outside of the Petsc group to set more 
> things when they make MRs. And being to easily add non-PETSc folks as 
> reviewers.
> 
>   Barry
> 



Re: [petsc-dev] Gitlab workflow discussion with GitLab developers

2022-01-20 Thread Junchao Zhang
*  Email notification when one is mentioned or added as a reviewer
*  Color text in comment box
*  Click a failed job, run the job with the *updated* branch
*  Allow one to reorder commits (e.g., the fix up commits generated from
applying comments) and mark commits that should be fixed up
*  Easily retarget a branch, e.g., from main to release (currently I have
to checkout to local machine, do rebase, then push)

--Junchao Zhang


On Thu, Jan 20, 2022 at 7:05 PM Barry Smith  wrote:

>
>   I got asked to go over some of my Gitlab workflow uses next week with
> some Gitlab developers; they do this to understand how Gitlab is used, how
> it can be improved etc.
>
>   If anyone has ideas on topics I should hit, let me know. I will hit them
> on the brokenness of appropriate code-owners not being automatically added
> to reviewers. And support for people outside of the Petsc group to set more
> things when they make MRs. And being to easily add non-PETSc folks as
> reviewers.
>
>   Barry
>
>


Re: [petsc-dev] Gitlab workflow discussion with GitLab developers

2022-01-20 Thread Matthew Knepley
Hashes should work in the search box. Also, the searching in MRs is not
that accurate.

   Matt

On Thu, Jan 20, 2022 at 8:05 PM Barry Smith  wrote:

>
>   I got asked to go over some of my Gitlab workflow uses next week with
> some Gitlab developers; they do this to understand how Gitlab is used, how
> it can be improved etc.
>
>   If anyone has ideas on topics I should hit, let me know. I will hit them
> on the brokenness of appropriate code-owners not being automatically added
> to reviewers. And support for people outside of the Petsc group to set more
> things when they make MRs. And being to easily add non-PETSc folks as
> reviewers.
>
>   Barry
>
>

-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener

https://www.cse.buffalo.edu/~knepley/ 


[petsc-dev] Gitlab workflow discussion with GitLab developers

2022-01-20 Thread Barry Smith


  I got asked to go over some of my Gitlab workflow uses next week with some 
Gitlab developers; they do this to understand how Gitlab is used, how it can be 
improved etc. 

  If anyone has ideas on topics I should hit, let me know. I will hit them on 
the brokenness of appropriate code-owners not being automatically added to 
reviewers. And support for people outside of the Petsc group to set more things 
when they make MRs. And being to easily add non-PETSc folks as reviewers.

  Barry



Re: [petsc-dev] Using PETSC with GPUs

2022-01-20 Thread Barry Smith


> On Jan 20, 2022, at 5:24 PM, Rohan Yadav  wrote:
> 
> Thanks barry, this is what I was looking for. However, it doesn't seem to be 
> working for me (the reported times are significantly different still with 
> -log_view on and off).

   I think this is because your second loop will overlap the additional kernel 
launches with the GPU computations without -log_view but will not overlap with 
the -log_view (since -log_view forces each MatMult to end before the next one 
can be launched by the CPU). If you put the PetscLogGpuTimeBegin/End within the 
loop then the -log_view should have much less effect. But I am not sure exactly 
what will happen with them inside the loop and with -log_view since there will 
be "extra" PetscLogGpuTimeEnd synchronization points; I don't think they will 
matter but I cannot say for sure. Like I said, tricky.




> Here is my exact timing code:
> ```
> double avgTime = 0.0;
>   {
> PetscLogDouble start, end;
> PetscLogGpuTimeBegin();
> for (int i = 0; i < warmup; i++) {
>   MatMult(A, x, y);
> }
> PetscLogGpuTimeEnd();
> PetscLogGpuTimeBegin();
> PetscTime();
> for (int i = 0; i < niter; i++) {
>   MatMult(A, x, y);
> }
> PetscLogGpuTimeEnd();
> PetscTime();
> auto sec = end - start;
> avgTime = double(sec) / double(niter);
>   }
> ```
> I'm measuring the time for a group of MatMult's as you suggested (with some 
> warmup iterations).
> 
> Rohan
> 
> On Thu, Jan 20, 2022 at 1:42 PM Barry Smith  > wrote:
> 
>Some operations on the GPU are asynchronous, the CPU passes the kernel 
> launch to the GPU and then immediately returns ready to do something else 
> before the kernel is completed (or even started). Some like VecDot() where 
> the result is stored in a CPU memory have to block until the kernel is 
> complete and the result copied up to the CPU.  
> 
>   -log_view forces a the calls to PetscLogGpuTimeEnd() which has (for CUDA) 
> 
> cerr = 
> cudaEventRecord(petsc_gputimer_end,PetscDefaultCudaStream);CHKERRCUDA(cerr);
> cerr = cudaEventSynchronize(petsc_gputimer_end);CHKERRCUDA(cerr);
> cerr = 
> cudaEventElapsedTime(,petsc_gputimer_begin,petsc_gputimer_end);CHKERRCUDA(cerr);
> petsc_gtime += (PetscLogDouble)gtime/1000.0; /* convert milliseconds to 
> seconds */
> 
> which essentially causes the CPU to wait until the kernel is complete, hence 
> your time with -log_view captures the full time to run the kernel.
> 
> So timing with GPUs can be a tricky business (when do you want to block and 
> when do you not?) For your loop, you may want to use
> 
> PetscLogGpuTimeBegin()
>> start = now()
>> 
>> for (int i = 0; i < 10; i++) {
>> MatMult(A, x, y);
>> }
> PetscLogGpuTimeEnd()
>> end = now()
>> print(end - start / 10)
>> ```
> 
> 
> Now after the loop it will wait until all the multiplies are completely done; 
> giving a better view of the time it takes. If you did
> 
> 
>> start = now()
>> 
>> for (int i = 0; i < 10; i++) {
> PetscLogGpuTimeBegin()
>> MatMult(A, x, y);
> PetscLogGpuTimeEnd()
>> }
>> end = now()
>> print(end - start / 10)
>> ```
> 
> You would wait a longer time because the CPU could not tell the GPU about the 
> second kernel launch until the first kernel is completely done. Hence there 
> would be no overlap of GPU computation and CPU kernel launches (which take a 
> long time). 
> 
> IMHO timing individual operations like a single MatMult() on GPUs only has a 
> certain level of usefulness since you slow down the computation (by removing 
> the asynchronous nature between the GPU and CPU)  in order to get accurate 
> times. It is better to time something like a complete line solver, nonlinear 
> solve etc and not log at a finer granularity.
> 
> Barry
> 
> 
> 
> 
> 
>> On Jan 20, 2022, at 4:07 PM, Rohan Yadav > > wrote:
>> 
>> Another small question -- I'm a little confused around timing GPU codes with 
>> PETSc. I have a code that looks like:
>> ```
>> start = now()
>> for (int i = 0; i < 10; i++) {
>> MatMult(A, x, y);
>> }
>> end = now()
>> print(end - start / 10)
>> ```
>> 
>> If I run this program with `-vec_type cuda -mat_type aijcusparse`, the GPUs 
>> are indeed utilized, but the recorded time is very tiny (i imagine just 
>> tracking the cost of launching cuda kernels). However, if I add `-log_view` 
>> to the command line arguments, then the resulting time printed matches what 
>> is recorded by `nvprof`. What is the correct way to benchmark PETSc with 
>> GPUs without having -log_view turned on?
>> 
>> Thanks,
>> 
>> Rohan
>> 
>> On Sat, Jan 15, 2022 at 7:37 AM Barry Smith > > wrote:
>> 
>>   Oh yes, you are correct for this operation since the handling of different 
>> nonzero pattern is not trivial to implement well for the GPU.
>> 
>>> On Jan 15, 2022, at 1:17 AM, Rohan Yadav >> > wrote:
>>> 
>>> Scanning the source code for mpiseqaijcusparse confirms my 

Re: [petsc-dev] Using PETSC with GPUs

2022-01-20 Thread Rohan Yadav
Thanks barry, this is what I was looking for. However, it doesn't seem to
be working for me (the reported times are significantly different still
with -log_view on and off). Here is my exact timing code:
```
double avgTime = 0.0;
  {
PetscLogDouble start, end;
PetscLogGpuTimeBegin();
for (int i = 0; i < warmup; i++) {
  MatMult(A, x, y);
}
PetscLogGpuTimeEnd();
PetscLogGpuTimeBegin();
PetscTime();
for (int i = 0; i < niter; i++) {
  MatMult(A, x, y);
}
PetscLogGpuTimeEnd();
PetscTime();
auto sec = end - start;
avgTime = double(sec) / double(niter);
  }
```
I'm measuring the time for a group of MatMult's as you suggested (with some
warmup iterations).

Rohan

On Thu, Jan 20, 2022 at 1:42 PM Barry Smith  wrote:

>
>Some operations on the GPU are asynchronous, the CPU passes the kernel
> launch to the GPU and then immediately returns ready to do something else
> before the kernel is completed (or even started). Some like VecDot() where
> the result is stored in a CPU memory have to block until the kernel is
> complete and the result copied up to the CPU.
>
>   -log_view forces a the calls to PetscLogGpuTimeEnd() which has (for
> CUDA)
>
> cerr =
> cudaEventRecord(petsc_gputimer_end,PetscDefaultCudaStream);CHKERRCUDA(cerr);
> cerr = cudaEventSynchronize(petsc_gputimer_end);CHKERRCUDA(cerr);
> cerr =
> cudaEventElapsedTime(,petsc_gputimer_begin,petsc_gputimer_end);CHKERRCUDA(cerr);
> petsc_gtime += (PetscLogDouble)gtime/1000.0; /* convert milliseconds to
> seconds */
>
> which essentially causes the CPU to wait until the kernel is complete,
> hence your time with -log_view captures the full time to run the kernel.
>
> So timing with GPUs can be a tricky business (when do you want to block
> and when do you not?) For your loop, you may want to use
>
> PetscLogGpuTimeBegin()
>
> start = now()
>
>
> for (int i = 0; i < 10; i++) {
> MatMult(A, x, y);
> }
>
> PetscLogGpuTimeEnd()
>
> end = now()
> print(end - start / 10)
> ```
>
>
> Now after the loop it will wait until all the multiplies are completely
> done; giving a better view of the time it takes. If you did
>
>
> start = now()
>
>
> for (int i = 0; i < 10; i++) {
>
> PetscLogGpuTimeBegin()
>
> MatMult(A, x, y);
>
> PetscLogGpuTimeEnd()
>
> }
>
> end = now()
> print(end - start / 10)
> ```
>
>
> You would wait a longer time because the CPU could not tell the GPU about
> the second kernel launch until the first kernel is completely done. Hence
> there would be no overlap of GPU computation and CPU kernel launches (which
> take a long time).
>
> IMHO timing individual operations like a single MatMult() on GPUs only has
> a certain level of usefulness since you slow down the computation (by
> removing the asynchronous nature between the GPU and CPU)  in order to get
> accurate times. It is better to time something like a complete line solver,
> nonlinear solve etc and not log at a finer granularity.
>
> Barry
>
>
>
>
>
> On Jan 20, 2022, at 4:07 PM, Rohan Yadav  wrote:
>
> Another small question -- I'm a little confused around timing GPU codes
> with PETSc. I have a code that looks like:
> ```
> start = now()
> for (int i = 0; i < 10; i++) {
> MatMult(A, x, y);
> }
> end = now()
> print(end - start / 10)
> ```
>
> If I run this program with `-vec_type cuda -mat_type aijcusparse`, the
> GPUs are indeed utilized, but the recorded time is very tiny (i imagine
> just tracking the cost of launching cuda kernels). However, if I add
> `-log_view` to the command line arguments, then the resulting time printed
> matches what is recorded by `nvprof`. What is the correct way to benchmark
> PETSc with GPUs without having -log_view turned on?
>
> Thanks,
>
> Rohan
>
> On Sat, Jan 15, 2022 at 7:37 AM Barry Smith  wrote:
>
>>
>>   Oh yes, you are correct for this operation since the handling of
>> different nonzero pattern is not trivial to implement well for the GPU.
>>
>> On Jan 15, 2022, at 1:17 AM, Rohan Yadav  wrote:
>>
>> Scanning the source code for mpiseqaijcusparse confirms my thoughts --
>> when used with DIFFERENT_NONZERO_PATTERN, it falls back to calling
>> MatAXPY_SeqAIJ, copying the data back over to the host.
>>
>> Rohan
>>
>> On Fri, Jan 14, 2022 at 10:16 PM Rohan Yadav 
>> wrote:
>>
>>>
>>>
>>> -- Forwarded message -
>>> From: Rohan Yadav 
>>> Date: Fri, Jan 14, 2022 at 10:03 PM
>>> Subject: Re: [petsc-dev] Using PETSC with GPUs
>>> To: Barry Smith 
>>>
>>>
>>> Ok, I'll try looking with greps like and see what I find.
>>>
>>> >  My guess why your code is not using the seqaijcusparse is that you
>>> are not setting the type before you call MatLoad() hence it loads with
>>> SeqAIJ. -mat_type does not magically change a type once a matrix has a set
>>> type. I agree our documentation on how to make objects be GPU objects is
>>> horrible now.
>>>
>>> I printed out my matrices with the PetscViewer objects and can confirm
>>> that the type is seqaijcusparse. Perhaps for the 

Re: [petsc-dev] Using PETSC with GPUs

2022-01-20 Thread Barry Smith

   Some operations on the GPU are asynchronous, the CPU passes the kernel 
launch to the GPU and then immediately returns ready to do something else 
before the kernel is completed (or even started). Some like VecDot() where the 
result is stored in a CPU memory have to block until the kernel is complete and 
the result copied up to the CPU.  

  -log_view forces a the calls to PetscLogGpuTimeEnd() which has (for CUDA) 

cerr = 
cudaEventRecord(petsc_gputimer_end,PetscDefaultCudaStream);CHKERRCUDA(cerr);
cerr = cudaEventSynchronize(petsc_gputimer_end);CHKERRCUDA(cerr);
cerr = 
cudaEventElapsedTime(,petsc_gputimer_begin,petsc_gputimer_end);CHKERRCUDA(cerr);
petsc_gtime += (PetscLogDouble)gtime/1000.0; /* convert milliseconds to seconds 
*/

which essentially causes the CPU to wait until the kernel is complete, hence 
your time with -log_view captures the full time to run the kernel.

So timing with GPUs can be a tricky business (when do you want to block and 
when do you not?) For your loop, you may want to use

PetscLogGpuTimeBegin()
> start = now()
> 
> for (int i = 0; i < 10; i++) {
> MatMult(A, x, y);
> }
PetscLogGpuTimeEnd()
> end = now()
> print(end - start / 10)
> ```


Now after the loop it will wait until all the multiplies are completely done; 
giving a better view of the time it takes. If you did


> start = now()
> 
> for (int i = 0; i < 10; i++) {
PetscLogGpuTimeBegin()
> MatMult(A, x, y);
PetscLogGpuTimeEnd()
> }
> end = now()
> print(end - start / 10)
> ```

You would wait a longer time because the CPU could not tell the GPU about the 
second kernel launch until the first kernel is completely done. Hence there 
would be no overlap of GPU computation and CPU kernel launches (which take a 
long time). 

IMHO timing individual operations like a single MatMult() on GPUs only has a 
certain level of usefulness since you slow down the computation (by removing 
the asynchronous nature between the GPU and CPU)  in order to get accurate 
times. It is better to time something like a complete line solver, nonlinear 
solve etc and not log at a finer granularity.

Barry





> On Jan 20, 2022, at 4:07 PM, Rohan Yadav  wrote:
> 
> Another small question -- I'm a little confused around timing GPU codes with 
> PETSc. I have a code that looks like:
> ```
> start = now()
> for (int i = 0; i < 10; i++) {
> MatMult(A, x, y);
> }
> end = now()
> print(end - start / 10)
> ```
> 
> If I run this program with `-vec_type cuda -mat_type aijcusparse`, the GPUs 
> are indeed utilized, but the recorded time is very tiny (i imagine just 
> tracking the cost of launching cuda kernels). However, if I add `-log_view` 
> to the command line arguments, then the resulting time printed matches what 
> is recorded by `nvprof`. What is the correct way to benchmark PETSc with GPUs 
> without having -log_view turned on?
> 
> Thanks,
> 
> Rohan
> 
> On Sat, Jan 15, 2022 at 7:37 AM Barry Smith  > wrote:
> 
>   Oh yes, you are correct for this operation since the handling of different 
> nonzero pattern is not trivial to implement well for the GPU.
> 
>> On Jan 15, 2022, at 1:17 AM, Rohan Yadav > > wrote:
>> 
>> Scanning the source code for mpiseqaijcusparse confirms my thoughts -- when 
>> used with DIFFERENT_NONZERO_PATTERN, it falls back to calling 
>> MatAXPY_SeqAIJ, copying the data back over to the host.
>> 
>> Rohan
>> 
>> On Fri, Jan 14, 2022 at 10:16 PM Rohan Yadav > > wrote:
>> 
>> 
>> -- Forwarded message -
>> From: Rohan Yadav mailto:roh...@alumni.cmu.edu>>
>> Date: Fri, Jan 14, 2022 at 10:03 PM
>> Subject: Re: [petsc-dev] Using PETSC with GPUs
>> To: Barry Smith mailto:bsm...@petsc.dev>>
>> 
>> 
>> Ok, I'll try looking with greps like and see what I find.
>> 
>> >  My guess why your code is not using the seqaijcusparse is that you are 
>> > not setting the type before you call MatLoad() hence it loads with SeqAIJ. 
>> > -mat_type does not magically change a type once a matrix has a set type. I 
>> > agree our documentation on how to make objects be GPU objects is horrible 
>> > now.
>> 
>> I printed out my matrices with the PetscViewer objects and can confirm that 
>> the type is seqaijcusparse. Perhaps for the way I'm using it 
>> (DIFFERENT_NONZERO_PATTERN) the kernel is unsupported? I'm not sure how to 
>> get any more diagnostic info about why the cuda kernel isn't called...
>> 
>> Rohan
>> 
>> On Fri, Jan 14, 2022 at 9:46 PM Barry Smith > > wrote:
>> 
>>   This changes rapidly and depends on if the backend is CUDA, HIP, Sycl, or 
>> Kokkos. The only way to find out definitively is with, for example, 
>> 
>> git grep MatMult_ | egrep -i "(cusparse|cublas|cuda)"
>> 
>> 
>>   Because of our, unfortunately, earlier naming choices you need to kind of 
>> know what to grep for, for CUDA it may be cuSparse or cuBLAS
>> 
>>   Not yet merged branches may also have some operations that are 

Re: [petsc-dev] Using PETSC with GPUs

2022-01-20 Thread Rohan Yadav
Another small question -- I'm a little confused around timing GPU codes
with PETSc. I have a code that looks like:
```
start = now()
for (int i = 0; i < 10; i++) {
MatMult(A, x, y);
}
end = now()
print(end - start / 10)
```

If I run this program with `-vec_type cuda -mat_type aijcusparse`, the GPUs
are indeed utilized, but the recorded time is very tiny (i imagine just
tracking the cost of launching cuda kernels). However, if I add `-log_view`
to the command line arguments, then the resulting time printed matches what
is recorded by `nvprof`. What is the correct way to benchmark PETSc with
GPUs without having -log_view turned on?

Thanks,

Rohan

On Sat, Jan 15, 2022 at 7:37 AM Barry Smith  wrote:

>
>   Oh yes, you are correct for this operation since the handling of
> different nonzero pattern is not trivial to implement well for the GPU.
>
> On Jan 15, 2022, at 1:17 AM, Rohan Yadav  wrote:
>
> Scanning the source code for mpiseqaijcusparse confirms my thoughts --
> when used with DIFFERENT_NONZERO_PATTERN, it falls back to calling
> MatAXPY_SeqAIJ, copying the data back over to the host.
>
> Rohan
>
> On Fri, Jan 14, 2022 at 10:16 PM Rohan Yadav 
> wrote:
>
>>
>>
>> -- Forwarded message -
>> From: Rohan Yadav 
>> Date: Fri, Jan 14, 2022 at 10:03 PM
>> Subject: Re: [petsc-dev] Using PETSC with GPUs
>> To: Barry Smith 
>>
>>
>> Ok, I'll try looking with greps like and see what I find.
>>
>> >  My guess why your code is not using the seqaijcusparse is that you are
>> not setting the type before you call MatLoad() hence it loads with SeqAIJ.
>> -mat_type does not magically change a type once a matrix has a set type. I
>> agree our documentation on how to make objects be GPU objects is horrible
>> now.
>>
>> I printed out my matrices with the PetscViewer objects and can confirm
>> that the type is seqaijcusparse. Perhaps for the way I'm using it
>> (DIFFERENT_NONZERO_PATTERN) the kernel is unsupported? I'm not sure how to
>> get any more diagnostic info about why the cuda kernel isn't called...
>>
>> Rohan
>>
>> On Fri, Jan 14, 2022 at 9:46 PM Barry Smith  wrote:
>>
>>>
>>>   This changes rapidly and depends on if the backend is CUDA, HIP, Sycl,
>>> or Kokkos. The only way to find out definitively is with, for example,
>>>
>>> git grep MatMult_ | egrep -i "(cusparse|cublas|cuda)"
>>>
>>>
>>>   Because of our, unfortunately, earlier naming choices you need to kind
>>> of know what to grep for, for CUDA it may be cuSparse or cuBLAS
>>>
>>>   Not yet merged branches may also have some operations that are still
>>> being developed.
>>>
>>>   My guess why your code is not using the seqaijcusparse is that you are
>>> not setting the type before you call MatLoad() hence it loads with SeqAIJ.
>>> -mat_type does not magically change a type once a matrix has a set type. I
>>> agree our documentation on how to make objects be GPU objects is horrible
>>> now.
>>>
>>>   Barry
>>>
>>>
>>> On Jan 15, 2022, at 12:31 AM, Rohan Yadav  wrote:
>>>
>>> I was wondering if there is a definitive list for what operations are
>>> and aren't supported for distributed GPU execution. For some operations,
>>> like `MatMult`, it is clear that MPIAIJCUSPARSE implements MatMult from the
>>> documentation, but other operations it is unclear, such as MatMatMult.
>>> Another scenario is the MatAXPY kernel, which supposedly has a
>>> SeqAIJCUSPARSE implementation, which I take means that it can only execute
>>> on a single GPU. However, even if I pass -mat_type seqaijcusparse to the
>>> kernel it doesn't seem to utilize the GPU.
>>>
>>> Rohan
>>>
>>> On Fri, Jan 14, 2022 at 4:05 PM Barry Smith  wrote:
>>>

   Just use 1 MPI rank.


 
 EventCount  Time (sec) Flop
  --- Global ---  --- Stage   Total   GPU- CpuToGpu -   -
 GpuToCpu - GPU
Max Ratio  Max Ratio   Max  Ratio  Mess   AvgLen
  Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s Mflop/s Count   Size
 Count   Size  %F

 ---

 --- Event Stage 0: Main Stage

 BuildTwoSided  1 1.0 1.8650e-013467.8 0.00e+00 0.0 2.0e+00
 4.0e+00 1.0e+00  0  0  3  0  2   0  0  3  0  4 0   0  0
 0.00e+000 0.00e+00  0
 MatMult   30 1.0 6.6642e+01 1.0 1.16e+10 1.0 6.4e+01
 6.4e+08 1.0e+00 65100 91 93  2  65100 91 93  4   346   0  0
 0.00e+00   31 2.65e+04  0

 From this it is clear the matrix never ended up on the GPU, but the
 vector did. For each multiply, it is copying the vector from the GPU to the
 CPU and then doing the MatMult on the CPU. If the MatMult was done on the
 GPU the file number in the row would