[GitHub] piiswrong closed pull request #11248: improve memory allocation error

GitBox Wed, 15 Aug 2018 16:25:13 -0700

piiswrong closed pull request #11248: improve memory allocation error
URL: https://github.com/apache/incubator-mxnet/pull/11248


This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/src/storage/cpu_device_storage.h b/src/storage/cpu_device_storage.h
index 43e98fe04a1..593cdd7ca3c 100644
--- a/src/storage/cpu_device_storage.h
+++ b/src/storage/cpu_device_storage.h
@@ -67,11 +67,20 @@ inline void* CPUDeviceStorage::Alloc(size_t size) {
   void* ptr;
 #if _MSC_VER
   ptr = _aligned_malloc(size, alignment_);
-  if (ptr == NULL) LOG(FATAL) << "Failed to allocate CPU Memory";
+  if (ptr == NULL)
 #else
   int ret = posix_memalign(&ptr, alignment_, size);
-  if (ret != 0) LOG(FATAL) << "Failed to allocate CPU Memory";
+  if (ret != 0)
 #endif
+    LOG(FATAL)
+      << "Failed to allocate CPU Memory. The most likely cause is "
+      << "out-of-memory. Things you can try to resolve this include:\n"
+      << "    1. Reduce batch size.\n"
+      << "    2. If you are using Gluon, you need to synchronize at each "
+      << "iteration. Otherwise you will keep allocating memory without "
+      << "waiting for compute to finish. You can use .asnumpy() to 
synchronize. "
+      << "Normally, it's loss.asnumpy(). You can also add mx.nd.waitall() 
after "
+      << "each iteration, but this may hurt performance.";
   return ptr;
 }
 
diff --git a/src/storage/cpu_shared_storage_manager.h 
b/src/storage/cpu_shared_storage_manager.h
index a52d779d231..38a0c20f214 100644
--- a/src/storage/cpu_shared_storage_manager.h
+++ b/src/storage/cpu_shared_storage_manager.h
@@ -175,8 +175,16 @@ void CPUSharedStorageManager::Alloc(Storage::Handle* 
handle) {
 
   if (fid == -1) {
     if (is_new) {
-      LOG(FATAL) << "Failed to open shared memory. shm_open failed with error "
-                 << strerror(errno);
+      if (errno == EMFILE) {
+        LOG(FATAL)
+          << "Failed to create shared memory because limit on the number of "
+          << "open file descriptors has been reached. You can increase the "
+          << "limit by typing 'ulimit -n' followed by a big number in 
terminal.";
+      } else {
+        LOG(FATAL)
+          << "Failed to create shared memory. shm_open failed with error "
+          << strerror(errno);
+      }
     } else {
       LOG(FATAL) << "Invalid file descriptor from shared array.";
     }
diff --git a/src/storage/gpu_device_storage.h b/src/storage/gpu_device_storage.h
index 435c7e81d2a..68768f35d63 100644
--- a/src/storage/gpu_device_storage.h
+++ b/src/storage/gpu_device_storage.h
@@ -62,7 +62,16 @@ inline void* GPUDeviceStorage::Alloc(size_t size) {
 #endif  // MXNET_USE_NCCL
   cudaError_t e = cudaMalloc(&ret, size);
   if (e != cudaSuccess && e != cudaErrorCudartUnloading)
-    LOG(FATAL) << "CUDA: " << cudaGetErrorString(e);
+    LOG(FATAL)
+      << "Failed to allocate GPU memory: " << cudaGetErrorString(e)
+      << ". The most likely cause is out-of-memory. "
+      << "Things you can try to resolve this include:\n"
+      << "    1. Reduce batch size.\n"
+      << "    2. If you are using Gluon, you need to synchronize at each "
+      << "iteration. Otherwise you will keep allocating memory without "
+      << "waiting for compute to finish. You can use .asnumpy() to 
synchronize. "
+      << "Normally, it's loss.asnumpy(). You can also add mx.nd.waitall() 
after "
+      << "each iteration, but this may hurt performance.";
 #else   // MXNET_USE_CUDA
   LOG(FATAL) << "Please compile with CUDA enabled";
 #endif  // MXNET_USE_CUDA


 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] piiswrong closed pull request #11248: improve memory allocation error

Reply via email to