[GitHub] [incubator-mxnet] jfrank94 opened a new issue #20363: Issues with GluonTS library (MXNet Error)

GitBox Fri, 18 Jun 2021 14:16:57 -0700


jfrank94 opened a new issue #20363:
URL: https://github.com/apache/incubator-mxnet/issues/20363



   I'm running into an error when running the DeepAREstimator from the GluonTS 
library. I have downloaded all of the necessary packages, and with using the 
mx.gpu() function, it recognizes that the GPU (Nvidia Cuda 11.2) exists on the 
system
   
   Note, that I'm able to run the code fine on Colab without having to run 
specific commands like installing "libquadmath0" or the NCCL library (ver 2.8.4 
for cuda 11.2), but when running on a docker image, this seems to be the case. 
   
   Here's the full error trace: 
   
   `0%|          | 0/1 [00:00<?, ?it/s]
   learning rate from "lr_scheduler" has been overwritten by "learning_rate" in 
optimizer.
     0%|          | 0/1 [00:03<?, ?it/s]
   ---------------------------------------------------------------------------
   MXNetError                                Traceback (most recent call last)
   <ipython-input-7-33795d1460ee> in <module>
        29       print("\nPatient {} - Amount of Days (Train): {}\n | Amount of 
Days (Valid): {}\n".format(p_id, train_days, valid_days))
        30 
   ---> 31       train1_output = estimator.train(training_data=training_data, 
validation_data=validation_data)
        32     #print(agg_metrics)
   
   /usr/local/lib/python3.6/dist-packages/gluonts/mx/model/estimator.py in 
train(self, training_data, validation_data, num_workers, num_prefetch, 
shuffle_buffer_length, cache_data, **kwargs)
       205             num_prefetch=num_prefetch,
       206             shuffle_buffer_length=shuffle_buffer_length,
   --> 207             cache_data=cache_data,
       208         ).predictor
   
   /usr/local/lib/python3.6/dist-packages/gluonts/mx/model/estimator.py in 
train_model(self, training_data, validation_data, num_workers, num_prefetch, 
shuffle_buffer_length, cache_data)
       177             net=training_network,
       178             train_iter=training_data_loader,
   --> 179             validation_iter=validation_data_loader,
       180         )
       181 
   
   /usr/local/lib/python3.6/dist-packages/gluonts/mx/trainer/_base.py in 
__call__(self, net, train_iter, validation_iter)
       377                         epoch_no,
       378                         train_iter,
   --> 379                         
num_batches_to_use=self.num_batches_per_epoch,
       380                     )
       381                     if is_validation_available:
   
   /usr/local/lib/python3.6/dist-packages/gluonts/mx/trainer/_base.py in 
loop(epoch_no, batch_iter, num_batches_to_use, is_training)
       308                                 batch_size = loss.shape[0]
       309 
   --> 310                             if not 
np.isfinite(ndarray.sum(loss).asscalar()):
       311                                 logger.warning(
       312                                     "Batch [%d] of Epoch[%d] gave 
NaN loss and it will be ignored",
   
   /usr/local/lib/python3.6/dist-packages/mxnet/ndarray/ndarray.py in 
asscalar(self)
      2583             raise ValueError("The current array is not a scalar")
      2584         if self.ndim == 1:
   -> 2585             return self.asnumpy()[0]
      2586         else:
      2587             return self.asnumpy()[()]
   
   /usr/local/lib/python3.6/dist-packages/mxnet/ndarray/ndarray.py in 
asnumpy(self)
      2564             self.handle,
      2565             data.ctypes.data_as(ctypes.c_void_p),
   -> 2566             ctypes.c_size_t(data.size)))
      2567         return data
      2568 
   
   /usr/local/lib/python3.6/dist-packages/mxnet/base.py in check_call(ret)
       244     """
       245     if ret != 0:
   --> 246         raise get_last_ffi_error()
       247 
       248 
   
   MXNetError: Traceback (most recent call last):
     File "../include/mshadow/././././cuda/tensor_gpu-inl.cuh", line 129
   Name: Check failed: err == cudaSuccess (209 vs. 0) : MapPlanKernel ErrStr:no 
kernel image is available for execution on the device`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [incubator-mxnet] jfrank94 opened a new issue #20363: Issues with GluonTS library (MXNet Error)

Reply via email to