jfrank94 opened a new issue #20363:
URL: https://github.com/apache/incubator-mxnet/issues/20363
I'm running into an error when running the DeepAREstimator from the GluonTS
library. I have downloaded all of the necessary packages, and with using the
mx.gpu() function, it recognizes that the GPU (Nvidia Cuda 11.2) exists on the
system
Note, that I'm able to run the code fine on Colab without having to run
specific commands like installing "libquadmath0" or the NCCL library (ver 2.8.4
for cuda 11.2), but when running on a docker image, this seems to be the case.
Here's the full error trace:
`0%| | 0/1 [00:00<?, ?it/s]
learning rate from "lr_scheduler" has been overwritten by "learning_rate" in
optimizer.
0%| | 0/1 [00:03<?, ?it/s]
---------------------------------------------------------------------------
MXNetError Traceback (most recent call last)
<ipython-input-7-33795d1460ee> in <module>
29 print("\nPatient {} - Amount of Days (Train): {}\n | Amount of
Days (Valid): {}\n".format(p_id, train_days, valid_days))
30
---> 31 train1_output = estimator.train(training_data=training_data,
validation_data=validation_data)
32 #print(agg_metrics)
/usr/local/lib/python3.6/dist-packages/gluonts/mx/model/estimator.py in
train(self, training_data, validation_data, num_workers, num_prefetch,
shuffle_buffer_length, cache_data, **kwargs)
205 num_prefetch=num_prefetch,
206 shuffle_buffer_length=shuffle_buffer_length,
--> 207 cache_data=cache_data,
208 ).predictor
/usr/local/lib/python3.6/dist-packages/gluonts/mx/model/estimator.py in
train_model(self, training_data, validation_data, num_workers, num_prefetch,
shuffle_buffer_length, cache_data)
177 net=training_network,
178 train_iter=training_data_loader,
--> 179 validation_iter=validation_data_loader,
180 )
181
/usr/local/lib/python3.6/dist-packages/gluonts/mx/trainer/_base.py in
__call__(self, net, train_iter, validation_iter)
377 epoch_no,
378 train_iter,
--> 379
num_batches_to_use=self.num_batches_per_epoch,
380 )
381 if is_validation_available:
/usr/local/lib/python3.6/dist-packages/gluonts/mx/trainer/_base.py in
loop(epoch_no, batch_iter, num_batches_to_use, is_training)
308 batch_size = loss.shape[0]
309
--> 310 if not
np.isfinite(ndarray.sum(loss).asscalar()):
311 logger.warning(
312 "Batch [%d] of Epoch[%d] gave
NaN loss and it will be ignored",
/usr/local/lib/python3.6/dist-packages/mxnet/ndarray/ndarray.py in
asscalar(self)
2583 raise ValueError("The current array is not a scalar")
2584 if self.ndim == 1:
-> 2585 return self.asnumpy()[0]
2586 else:
2587 return self.asnumpy()[()]
/usr/local/lib/python3.6/dist-packages/mxnet/ndarray/ndarray.py in
asnumpy(self)
2564 self.handle,
2565 data.ctypes.data_as(ctypes.c_void_p),
-> 2566 ctypes.c_size_t(data.size)))
2567 return data
2568
/usr/local/lib/python3.6/dist-packages/mxnet/base.py in check_call(ret)
244 """
245 if ret != 0:
--> 246 raise get_last_ffi_error()
247
248
MXNetError: Traceback (most recent call last):
File "../include/mshadow/././././cuda/tensor_gpu-inl.cuh", line 129
Name: Check failed: err == cudaSuccess (209 vs. 0) : MapPlanKernel ErrStr:no
kernel image is available for execution on the device`
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]