Hello everyone, I'm using MxNet 1.5.1 &&CUDA 10.0 to do distributed training. I 
use horovod as well. The fc layer in my model is too large, so I just try to 
apply model parallel in that layer. Previous work is done by others and they 
use Module API. I implement an allreduce CustomOp to join it to the symbol 
chain. This op looks like this:



class HorovodAllReduce(mx.operator.CustomOp):

    def __init__(self, average=True, name=None):
        self.average = bool(average == 'True' or average is True)
        self.name = name
        self._num_ranks = hvd.local_rank()

    def forward(self, is_train, req, in_data, out_data, aux):
        x = in_data[0]
        name = self.name if self.name else 'hvd-no-name'
        y = env.hvd_framework().allreduce(x, average=self.average, name=name)
        self.assign(out_data[0], req[0], y.asnumpy())

    def backward(self, req, out_grad, in_data, out_data, in_grad, aux):
        out_grad = out_grad[0]
        if self.average:
            out_grad = out_grad / self._num_ranks
        self.assign(in_grad[0], req[0], out_grad)


In Module construction, I pass the gpu device as the value for `group2ctxs` 
argument, then use `with AttrScope` to control symbols to run on the gpu 
device. However, once I tried to change the last line in the forward() function 
to `self.assign(out_data[0], req[0], mx.nd.array(y))` then I will receive an 
error from Horovod: what():  cudaEventSynchronize failed: an illegal memory 
access was encountered
I think this error comes from here: 
https://github.com/horovod/horovod/blob/v0.18.2/horovod/common/ops/cuda_operations.cc#L87

By the way, is there any way to know data for each symbol in GPU memory or not 
in Module API programming? In Gluon programming this is easy but for Module 
programming, I cannot find a way to do this. Even I pass the gpu context during 
the Module construction, I still see my softmax+fc part are not computed in GPU 
with Nvidia nsight system profiling.





---
[Visit 
Topic](https://discuss.mxnet.apache.org/t/with-horovod-what-cudaeventsynchronize-failed-an-illegal-memory-access-was-encountered/6675/1)
 or reply to this email to respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click 
here](https://discuss.mxnet.apache.org/email/unsubscribe/c8b90186ffff64ba44c2e86349caf13de234ef6a62966612061dacbdc9edfa34).

Reply via email to