ZheyuYe opened a new issue #18766:
URL: https://github.com/apache/incubator-mxnet/issues/18766


   ## Description
   I tried to run Horovod on the latest Mxnet master branch and it failed as 
below. 
   
   ### Steps to reproduce
   Horovod was installed by `HOROVOD_GPU_OPERATIONS=NCCL pip3 install  
--no-cache-dir horovod `
   
   [A gist 
exmaple](https://gist.github.com/ZheyuYe/f59b33c20a0b6fdcbf471af3e2d5ac64) 
revised by [offical mnist 
example](https://github.com/apache/incubator-mxnet/blob/master/example/distributed_training-horovod/gluon_mnist.py)
 was executed with `mpirun -np 4 -H localhost:4 -bind-to none -map-by slot 
python3 mxnet_mnist.py`
   
   Please point out the steps I missed if I did't run this example properly.
   ## Outputs
   ```bash
   File 
"/home/ubuntu/.local/lib/python3.6/site-packages/horovod/mxnet/__init__.py", 
line 154, in broadcast_parameters
       broadcast_(tensor, root_rank, name=str(name))
     File 
"/home/ubuntu/.local/lib/python3.6/site-packages/horovod/mxnet/mpi_ops.py", 
line 232, in broadcast_
       c_in = tensor.handle
   AttributeError: 'Parameter' object has no attribute 'handle'
   Traceback (most recent call last):
     File "gluon_mnist.py", line 144, in <module>
       hvd.broadcast_parameters(params, root_rank=0)
     File 
"/home/ubuntu/.local/lib/python3.6/site-packages/horovod/mxnet/__init__.py", 
line 154, in broadcast_parameters
       broadcast_(tensor, root_rank, name=str(name))
     File 
"/home/ubuntu/.local/lib/python3.6/site-packages/horovod/mxnet/mpi_ops.py", 
line 232, in broadcast_
       c_in = tensor.handle
   AttributeError: 'Parameter' object has no attribute 'handle'
   --------------------------------------------------------------------------
   Primary job  terminated normally, but 1 process returned
   a non-zero exit code. Per user-direction, the job has been aborted.
   --------------------------------------------------------------------------
   --------------------------------------------------------------------------
   mpirun detected that one or more processes exited with non-zero status, thus 
causing
   the job to be terminated. The first process to do so was:
   
     Process name: [[28180,1],1]
     Exit code:    1
   --------------------------------------------------------------------------
   ```
   
   ## Comments
   @leezu  @sxjscience 
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to