chichan01 opened a new issue #7614: Error in ---kv-store dist_***
URL: https://github.com/apache/incubator-mxnet/issues/7614
 
 
   For bugs or installation issues, please provide the following information.
   The more information you provide, the more likely people will be able to 
help you.
   
   ## Environment info
   Operating System: Ubuntu 16.04.3 LTS
   
   Compiler: gcc version 5.4.0 20160609 
   
   Package used (Python/R/Scala/Julia): python
   
   MXNet version: '0.11.1'
   
   Or if installed from source: git clone --recursive 
https://github.com/dmlc/mxnet
   
   MXNet commit hash (`git rev-parse HEAD`): 491f81e
   
   If you are using python package, please provide
   
   Python version and distribution: python 2.7
   
   If you are using R package, please provide
   
   R `sessionInfo()`:
   
   ## Error Message:
   python ../launch.py -H ../hosts -n 1 python measure.py --kv-store local
   or 
   python ../launch.py -H ../hosts -n 1 python measure.py --kv-store device 
   is fine.. but when I use dist_sync, dist_async, it turns out the below 
errors.. any suggesions?  
   
   Please paste the full error message, including stack trace.
   cc0011@thorin:../mxnet/tools/bandwidth$ python ../launch.py -H ../hosts -n 1 
python measure.py --kv-store dist_sync
   Ubuntu 16.04.3 LTS
   Ubuntu 16.04.3 LTS
   
/vol/vssp/deepface/cuda/lib64/:/vol/vssp/deepface/nccl/build/lib/:/opt/lib:/usr/local/db4.8/lib
   export: Command not found.
   export: Command not found.
   export: Command not found.
   export: Command not found.
   export: Command not found.
   export: Command not found.
   
/vol/vssp/deepface/cuda/lib64/:/vol/vssp/deepface/nccl/build/lib/:/opt/lib:/usr/local/db4.8/lib
   export: Command not found.
   export: Command not found.
   export: Command not found.
   export: Command not found.
   export: Command not found.
   export: Command not found.
   INFO:root:Namespace(disp_batches=1, gpus='0,1', image_shape='3,224,224', 
kv_store='dist_sync', network='resnet', num_batches=5, num_classes=1000, 
num_layers=152, optimizer='None', test_results=1)
   [10:30:56] ../mxnet/dmlc-core/include/dmlc/./logging.h:308: [10:30:56] 
src/postoffice.cc:16: Check  notnull: 
Environment::Get()->find("DMLC_NUM_WORKER") 
   
   Stack trace returned 10 entries:
   [bt] (0) 
../mxnet/python/mxnet/../../lib/libmxnet.so(_ZN4dmlc15LogMessageFatalD1Ev+0x3c) 
[0x7fc7cb27a25c]
   [bt] (1) 
./mxnet/python/mxnet/../../lib/libmxnet.so(_ZN2ps10PostofficeC1Ev+0x1dfe) 
[0x7fc7cc58a9de]
   [bt] (2) 
../mxnet/python/mxnet/../../lib/libmxnet.so(_ZN2ps8CustomerC2EiRKSt8functionIFvRKNS_7MessageEEE+0x8f3)
 [0x7fc7cc583183]
   [bt] (3) 
../mxnet/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet7KVStore6CreateEPKc+0x5b0) 
[0x7fc7cc4d5470]
   [bt] (4) ../mxnet/python/mxnet/../../lib/libmxnet.so(MXKVStoreCreate+0x9) 
[0x7fc7cc480449]
   [bt] (5) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call_unix64+0x4c) 
[0x7fc7fbbb8e40]
   [bt] (6) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call+0x2eb) 
[0x7fc7fbbb88ab]
   [bt] (7) 
/usr/lib/python2.7/lib-dynload/_ctypes.x86_64-linux-gnu.so(_ctypes_callproc+0x48f)
 [0x7fc7fbdc83df]
   [bt] (8) 
/usr/lib/python2.7/lib-dynload/_ctypes.x86_64-linux-gnu.so(+0x11d82) 
[0x7fc7fbdccd82]
   [bt] (9) python(PyObject_Call+0x43) [0x4b0cb3]
   
   Traceback (most recent call last):
     File "measure.py", line 144, in <module>
       run(**vars(args))
     File "measure.py", line 79, in run
       kv = mx.kv.create(kv_store)
     File "../mxnet/tools/bandwidth/../../python/mxnet/kvstore.py", line 513, 
in create
       ctypes.byref(handle)))
     File "../mxnet/tools/bandwidth/../../python/mxnet/base.py", line 143, in 
check_call
       raise MXNetError(py_str(_LIB.MXGetLastError()))
   mxnet.base.MXNetError: [10:30:56] src/postoffice.cc:16: Check  notnull: 
Environment::Get()->find("DMLC_NUM_WORKER") 
   
   Stack trace returned 10 entries:
   [bt] (0) 
../mxnet/python/mxnet/../../lib/libmxnet.so(_ZN4dmlc15LogMessageFatalD1Ev+0x3c) 
[0x7fc7cb27a25c]
   [bt] (1) 
../mxnet/python/mxnet/../../lib/libmxnet.so(_ZN2ps10PostofficeC1Ev+0x1dfe) 
[0x7fc7cc58a9de]
   [bt] (2) 
../mxnet/python/mxnet/../../lib/libmxnet.so(_ZN2ps8CustomerC2EiRKSt8functionIFvRKNS_7MessageEEE+0x8f3)
 [0x7fc7cc583183]
   [bt] (3) 
../mxnet/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet7KVStore6CreateEPKc+0x5b0) 
[0x7fc7cc4d5470]
   [bt] (4) ../mxnet/python/mxnet/../../lib/libmxnet.so(MXKVStoreCreate+0x9) 
[0x7fc7cc480449]
   [bt] (5) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call_unix64+0x4c) 
[0x7fc7fbbb8e40]
   [bt] (6) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call+0x2eb) 
[0x7fc7fbbb88ab]
   [bt] (7) 
/usr/lib/python2.7/lib-dynload/_ctypes.x86_64-linux-gnu.so(_ctypes_callproc+0x48f)
 [0x7fc7fbdc83df]
   [bt] (8) 
/usr/lib/python2.7/lib-dynload/_ctypes.x86_64-linux-gnu.so(+0x11d82) 
[0x7fc7fbdccd82]
   [bt] (9) python(PyObject_Call+0x43) [0x4b0cb3]
   
   Exception in thread Thread-3:
   Traceback (most recent call last):
     File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
       self.run()
     File "/usr/lib/python2.7/threading.py", line 754, in run
       self.__target(*self.__args, **self.__kwargs)
     File "../mxnet/tools/../dmlc-core/tracker/dmlc_tracker/ssh.py", line 60, 
in run
       subprocess.check_call(prog, shell = True)
     File "/usr/lib/python2.7/subprocess.py", line 541, in check_call
       raise CalledProcessError(retcode, cmd)
   CalledProcessError: Command 'ssh -o StrictHostKeyChecking=no 131.227.80.54 
-p 22 'export 
LD_LIBRARY_PATH=/vol/vssp/deepface/cuda/lib64/:/vol/vssp/deepface/nccl/build/lib/:/usr/local/cuda/lib64:/usr/lib//x86_64-linux-gnu/:/opt/lib:/usr/local/db4.8/lib;
 export DMLC_ROLE=worker; export DMLC_PS_ROOT_PORT=9125; export 
DMLC_PS_ROOT_URI=131.227.80.54; export DMLC_NUM_SERVER=1; export 
DMLC_NUM_WORKER=1; cd ../mxnet/tools/bandwidth/; python measure.py --kv-store 
dist_sync'' returned non-zero exit status 1
   
   INFO:root:Namespace(disp_batches=1, gpus='0,1', image_shape='3,224,224', 
kv_store='dist_sync', network='resnet', num_batches=5, num_classes=1000, 
num_layers=152, optimizer='None', test_results=1)
   [10:30:56] ../mxnet/dmlc-core/include/dmlc/./logging.h:308: [10:30:56] 
src/postoffice.cc:16: Check  notnull: 
Environment::Get()->find("DMLC_NUM_WORKER") 
   
   Stack trace returned 10 entries:
   [bt] (0) 
../mxnet/python/mxnet/../../lib/libmxnet.so(_ZN4dmlc15LogMessageFatalD1Ev+0x3c) 
[0x7fa9c2b0c25c]
   [bt] 
(1)../mxnet/python/mxnet/../../lib/libmxnet.so(_ZN2ps10PostofficeC1Ev+0x1dfe) 
[0x7fa9c3e1c9de]
   [bt] (2) 
../mxnet/python/mxnet/../../lib/libmxnet.so(_ZN2ps8CustomerC2EiRKSt8functionIFvRKNS_7MessageEEE+0x8f3)
 [0x7fa9c3e15183]
   [bt] (3) 
../mxnet/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet7KVStore6CreateEPKc+0x5b0) 
[0x7fa9c3d67470]
   [bt] (4) ../mxnet/python/mxnet/../../lib/libmxnet.so(MXKVStoreCreate+0x9) 
[0x7fa9c3d12449]
   [bt] (5) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call_unix64+0x4c) 
[0x7fa9f3450e40]
   [bt] (6) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call+0x2eb) 
[0x7fa9f34508ab]
   [bt] (7) 
/usr/lib/python2.7/lib-dynload/_ctypes.x86_64-linux-gnu.so(_ctypes_callproc+0x48f)
 [0x7fa9f36603df]
   [bt] (8) 
/usr/lib/python2.7/lib-dynload/_ctypes.x86_64-linux-gnu.so(+0x11d82) 
[0x7fa9f3664d82]
   [bt] (9) python(PyObject_Call+0x43) [0x4b0cb3]
   
   Traceback (most recent call last):
     File "measure.py", line 144, in <module>
       run(**vars(args))
     File "measure.py", line 79, in run
       kv = mx.kv.create(kv_store)
     File "../mxnet/tools/bandwidth/../../python/mxnet/kvstore.py", line 513, 
in create
       ctypes.byref(handle)))
     File "../mxnet/tools/bandwidth/../../python/mxnet/base.py", line 143, in 
check_call
       raise MXNetError(py_str(_LIB.MXGetLastError()))
   mxnet.base.MXNetError: [10:30:56] src/postoffice.cc:16: Check  notnull: 
Environment::Get()->find("DMLC_NUM_WORKER") 
   
   Stack trace returned 10 entries:
   [bt] (0) 
../mxnet/python/mxnet/../../lib/libmxnet.so(_ZN4dmlc15LogMessageFatalD1Ev+0x3c) 
[0x7fa9c2b0c25c]
   [bt] (1) 
../mxnet/python/mxnet/../../lib/libmxnet.so(_ZN2ps10PostofficeC1Ev+0x1dfe) 
[0x7fa9c3e1c9de]
   [bt] (2) 
../mxnet/python/mxnet/../../lib/libmxnet.so(_ZN2ps8CustomerC2EiRKSt8functionIFvRKNS_7MessageEEE+0x8f3)
 [0x7fa9c3e15183]
   [bt] (3) 
../mxnet/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet7KVStore6CreateEPKc+0x5b0) 
[0x7fa9c3d67470]
   [bt] (4)../mxnet/python/mxnet/../../lib/libmxnet.so(MXKVStoreCreate+0x9) 
[0x7fa9c3d12449]
   [bt] (5) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call_unix64+0x4c) 
[0x7fa9f3450e40]
   [bt] (6) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call+0x2eb) 
[0x7fa9f34508ab]
   [bt] (7) 
/usr/lib/python2.7/lib-dynload/_ctypes.x86_64-linux-gnu.so(_ctypes_callproc+0x48f)
 [0x7fa9f36603df]
   [bt] (8) 
/usr/lib/python2.7/lib-dynload/_ctypes.x86_64-linux-gnu.so(+0x11d82) 
[0x7fa9f3664d82]
   [bt] (9) python(PyObject_Call+0x43) [0x4b0cb3]
   
   Exception in thread Thread-2:
   Traceback (most recent call last):
     File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
       self.run()
     File "/usr/lib/python2.7/threading.py", line 754, in run
       self.__target(*self.__args, **self.__kwargs)
     File 
"/vol/vssp/project/Chan/mxnet/tools/../dmlc-core/tracker/dmlc_tracker/ssh.py", 
line 60, in run
       subprocess.check_call(prog, shell = True)
     File "/usr/lib/python2.7/subprocess.py", line 541, in check_call
       raise CalledProcessError(retcode, cmd)
   CalledProcessError: Command 'ssh -o StrictHostKeyChecking=no 131.227.80.58 
-p 22 'export 
LD_LIBRARY_PATH=/vol/vssp/deepface/cuda/lib64/:/vol/vssp/deepface/nccl/build/lib/:/usr/local/cuda/lib64:/opt/Papillon/lib/:/vol/vssp/project/Chan/caffe-face/build/lib:/usr/lib//x86_64-linux-gnu/:/opt/lib:/usr/local/db4.8/lib;
 export DMLC_ROLE=server; export DMLC_PS_ROOT_PORT=9125; export 
DMLC_PS_ROOT_URI=131.227.80.54; export DMLC_NUM_SERVER=1; export 
DMLC_NUM_WORKER=1; cd /vol/vssp/project/Chan/mxnet/tools/bandwidth/; python 
measure.py --kv-store dist_sync'' returned non-zero exit status 1
   
   ^C2017-08-25 10:31:01,457 INFO Stop launcher
   
   
   ## Minimum reproducible example
   if you are using your own code, please provide a short script that 
reproduces the error.
   
   ## Steps to reproduce
   or if you are running standard examples, please provide the commands you 
have run that lead to the error.
   
   1.
   2.
   3.
   
   ## What have you tried to solve it?
   
   1.
   2.
   3.
   
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

Reply via email to