> The memory overhead (for both CPU and GPU) of PyTorch is getting worse and
> worse as it evolves. A conjecture is that the CUDA kernels in the library are
> responsible for this. But the overhead for Tensorflow2 is just around 300MB
> (compare to 1.5GB for PyTorch).
I read through the thread
Here is an interesting thread discussing the memory issue for PyTorch (which I
think is also relevant to PETSc):
https://github.com/pytorch/pytorch/issues/12873
The memory overhead (for both CPU and GPU) of PyTorch is getting worse and
worse as it evolves. A conjecture is that the CUDA kernels
cuda-memcheck is a valgrind clone, but like valgrind it does not report
usage as it goes. Just in a report at the end.
On Fri, Jan 7, 2022 at 10:23 PM Barry Smith wrote:
>
> Doesn't Nvidia supply a "valgrind" like tool that will allow tracking
> memory usage? I'm pretty sure I've seen one; it
Doesn't Nvidia supply a "valgrind" like tool that will allow tracking memory
usage? I'm pretty sure I've seen one; it should be able to show memory usage as
a function of time so you can see where the memory is being allocated
Barry
> On Jan 7, 2022, at 1:56 PM, Jacob Faibussowitsch
> it seems that PETSc consumes 0.73GB CUDA memory and this overhead persists
> across the entire running time of an application. cupm_initialize contributes
> 0.36GB out of 0.73GB.
If I had to guess this may be the latent overhead of CUDA streams and events,
but even then 360 MB seems
Apart from the 1.2GB caused by importing torch, it seems that PETSc consumes
0.73GB CUDA memory and this overhead persists across the entire running time of
an application. cupm_initialize contributes 0.36GB out of 0.73GB. It is still
unclear what takes the remaining 0.37GB.
The torch issue is
1. Commenting out ierr =
__initialize(dctx->device->deviceId,dci);CHKERRQ(ierr); in
device/impls/cupm/cupmcontext.hpp:L199
CUDA memory: 1.575GB
CUDA memory without importing torch: 0.370GB
This has the same effect as commenting out L437-L440 in interface/device.cxx
2. Comment out these
> They had no influence to the memory usage.
???
Comment out the ierr = _devices[id]->initialize();CHKERRQ(ierr); on line 360 in
cupmdevice.cxx as well.
Best regards,
Jacob Faibussowitsch
(Jacob Fai - booss - oh - vitch)
>
I have tried all of these. They had no influence to the memory usage.
On Jan 7, 2022, at 11:15 AM, Jacob Faibussowitsch
mailto:jacob@gmail.com>> wrote:
Initializing cutlass and cusolver does not affect the memory usage. I did the
following to turn them off:
Ok next things to try out in
> Initializing cutlass and cusolver does not affect the memory usage. I did the
> following to turn them off:
Ok next things to try out in order:
1. src/sys/objects/device/impls/cupm/cupmcontext.hpp:178 [PetscFunctionBegin;]
Put a PetscFunctionReturn(0); right after this
2.
Are you sure the make dependencies are correct and the code got properly
recompiled with the commented out in the header file? It is difficult to
believe they use no memory.
What else in the PETSc initialization of the GPU would use huge chunks of
memory?
> On Jan 7, 2022, at 11:58
Initializing cutlass and cusolver does not affect the memory usage. I did the
following to turn them off:
diff --git a/src/sys/objects/device/impls/cupm/cupmcontext.hpp
b/src/sys/objects/device/impls/cupm/cupmcontext.hpp
index 51fed809e4d..9a5f068323a 100644
---
> I don't think this is right. We want the device initialized by PETSc , we
> just don't want the cublas and cusolve stuff initialized. In order to see how
> much memory initializing the blas and solvers takes.
This is how it has always been, PetscDevice adopted the same initialization
I don't think this is right. We want the device initialized by PETSc , we
just don't want the cublas and cusolve stuff initialized. In order to see how
much memory initializing the blas and solvers takes.
So I think you need to comment things in cupminterface.hpp like cublasCreate
and
Commenting out the block containing PetscDeviceContextXXX reduces the memory
cost from 1.9GB to 1.5GB.
Commenting out PetscDeviceInitializeTypeFromOptions_Private(0 reduces it to
0GB.
diff --git a/src/sys/objects/device/interface/device.cxx
b/src/sys/objects/device/interface/device.cxx
index
Hit send too early…
If you don’t want to comment out, you can also run with "-device_enable lazy"
option. Normally this is the default behavior but if -log_view or -log_summary
is provided this defaults to “-device_enable eager”. See
src/sys/objects/device/interface/device.cxx:398
Best
> You need to go into the PetscInitialize() routine find where it loads the
> cublas and cusolve and comment out those lines then run with -log_view
Comment out
#if (PetscDefined(HAVE_CUDA) || PetscDefined(HAVE_HIP) ||
PetscDefined(HAVE_SYCL))
ierr =
Without log_view it does not load any cuBLAS/cuSolve immediately with -log_view
it loads all that stuff at startup. You need to go into the PetscInitialize()
routine find where it loads the cublas and cusolve and comment out those lines
then run with -log_view
> On Jan 7, 2022, at 11:14 AM,
When PETSc is initialized, it takes about 2GB CUDA memory. This is way too much
for doing nothing. A test script is attached to reproduce the issue. If I
remove the first line "import torch", PETSc consumes about 0.73GB, which is
still significant. Does anyone have any idea about this behavior?
19 matches
Mail list logo