chichan01 opened a new issue #7614: Error in ---kv-store dist_*** URL: https://github.com/apache/incubator-mxnet/issues/7614 For bugs or installation issues, please provide the following information. The more information you provide, the more likely people will be able to help you. ## Environment info Operating System: Ubuntu 16.04.3 LTS Compiler: gcc version 5.4.0 20160609 Package used (Python/R/Scala/Julia): python MXNet version: '0.11.1' Or if installed from source: git clone --recursive https://github.com/dmlc/mxnet MXNet commit hash (`git rev-parse HEAD`): 491f81e If you are using python package, please provide Python version and distribution: python 2.7 If you are using R package, please provide R `sessionInfo()`: ## Error Message: python ../launch.py -H ../hosts -n 1 python measure.py --kv-store local or python ../launch.py -H ../hosts -n 1 python measure.py --kv-store device is fine.. but when I use dist_sync, dist_async, it turns out the below errors.. any suggesions? Please paste the full error message, including stack trace. cc0011@thorin:../mxnet/tools/bandwidth$ python ../launch.py -H ../hosts -n 1 python measure.py --kv-store dist_sync Ubuntu 16.04.3 LTS Ubuntu 16.04.3 LTS /vol/vssp/deepface/cuda/lib64/:/vol/vssp/deepface/nccl/build/lib/:/opt/lib:/usr/local/db4.8/lib export: Command not found. export: Command not found. export: Command not found. export: Command not found. export: Command not found. export: Command not found. /vol/vssp/deepface/cuda/lib64/:/vol/vssp/deepface/nccl/build/lib/:/opt/lib:/usr/local/db4.8/lib export: Command not found. export: Command not found. export: Command not found. export: Command not found. export: Command not found. export: Command not found. INFO:root:Namespace(disp_batches=1, gpus='0,1', image_shape='3,224,224', kv_store='dist_sync', network='resnet', num_batches=5, num_classes=1000, num_layers=152, optimizer='None', test_results=1) [10:30:56] ../mxnet/dmlc-core/include/dmlc/./logging.h:308: [10:30:56] src/postoffice.cc:16: Check notnull: Environment::Get()->find("DMLC_NUM_WORKER") Stack trace returned 10 entries: [bt] (0) ../mxnet/python/mxnet/../../lib/libmxnet.so(_ZN4dmlc15LogMessageFatalD1Ev+0x3c) [0x7fc7cb27a25c] [bt] (1) ./mxnet/python/mxnet/../../lib/libmxnet.so(_ZN2ps10PostofficeC1Ev+0x1dfe) [0x7fc7cc58a9de] [bt] (2) ../mxnet/python/mxnet/../../lib/libmxnet.so(_ZN2ps8CustomerC2EiRKSt8functionIFvRKNS_7MessageEEE+0x8f3) [0x7fc7cc583183] [bt] (3) ../mxnet/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet7KVStore6CreateEPKc+0x5b0) [0x7fc7cc4d5470] [bt] (4) ../mxnet/python/mxnet/../../lib/libmxnet.so(MXKVStoreCreate+0x9) [0x7fc7cc480449] [bt] (5) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call_unix64+0x4c) [0x7fc7fbbb8e40] [bt] (6) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call+0x2eb) [0x7fc7fbbb88ab] [bt] (7) /usr/lib/python2.7/lib-dynload/_ctypes.x86_64-linux-gnu.so(_ctypes_callproc+0x48f) [0x7fc7fbdc83df] [bt] (8) /usr/lib/python2.7/lib-dynload/_ctypes.x86_64-linux-gnu.so(+0x11d82) [0x7fc7fbdccd82] [bt] (9) python(PyObject_Call+0x43) [0x4b0cb3] Traceback (most recent call last): File "measure.py", line 144, in <module> run(**vars(args)) File "measure.py", line 79, in run kv = mx.kv.create(kv_store) File "../mxnet/tools/bandwidth/../../python/mxnet/kvstore.py", line 513, in create ctypes.byref(handle))) File "../mxnet/tools/bandwidth/../../python/mxnet/base.py", line 143, in check_call raise MXNetError(py_str(_LIB.MXGetLastError())) mxnet.base.MXNetError: [10:30:56] src/postoffice.cc:16: Check notnull: Environment::Get()->find("DMLC_NUM_WORKER") Stack trace returned 10 entries: [bt] (0) ../mxnet/python/mxnet/../../lib/libmxnet.so(_ZN4dmlc15LogMessageFatalD1Ev+0x3c) [0x7fc7cb27a25c] [bt] (1) ../mxnet/python/mxnet/../../lib/libmxnet.so(_ZN2ps10PostofficeC1Ev+0x1dfe) [0x7fc7cc58a9de] [bt] (2) ../mxnet/python/mxnet/../../lib/libmxnet.so(_ZN2ps8CustomerC2EiRKSt8functionIFvRKNS_7MessageEEE+0x8f3) [0x7fc7cc583183] [bt] (3) ../mxnet/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet7KVStore6CreateEPKc+0x5b0) [0x7fc7cc4d5470] [bt] (4) ../mxnet/python/mxnet/../../lib/libmxnet.so(MXKVStoreCreate+0x9) [0x7fc7cc480449] [bt] (5) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call_unix64+0x4c) [0x7fc7fbbb8e40] [bt] (6) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call+0x2eb) [0x7fc7fbbb88ab] [bt] (7) /usr/lib/python2.7/lib-dynload/_ctypes.x86_64-linux-gnu.so(_ctypes_callproc+0x48f) [0x7fc7fbdc83df] [bt] (8) /usr/lib/python2.7/lib-dynload/_ctypes.x86_64-linux-gnu.so(+0x11d82) [0x7fc7fbdccd82] [bt] (9) python(PyObject_Call+0x43) [0x4b0cb3] Exception in thread Thread-3: Traceback (most recent call last): File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner self.run() File "/usr/lib/python2.7/threading.py", line 754, in run self.__target(*self.__args, **self.__kwargs) File "../mxnet/tools/../dmlc-core/tracker/dmlc_tracker/ssh.py", line 60, in run subprocess.check_call(prog, shell = True) File "/usr/lib/python2.7/subprocess.py", line 541, in check_call raise CalledProcessError(retcode, cmd) CalledProcessError: Command 'ssh -o StrictHostKeyChecking=no 131.227.80.54 -p 22 'export LD_LIBRARY_PATH=/vol/vssp/deepface/cuda/lib64/:/vol/vssp/deepface/nccl/build/lib/:/usr/local/cuda/lib64:/usr/lib//x86_64-linux-gnu/:/opt/lib:/usr/local/db4.8/lib; export DMLC_ROLE=worker; export DMLC_PS_ROOT_PORT=9125; export DMLC_PS_ROOT_URI=131.227.80.54; export DMLC_NUM_SERVER=1; export DMLC_NUM_WORKER=1; cd ../mxnet/tools/bandwidth/; python measure.py --kv-store dist_sync'' returned non-zero exit status 1 INFO:root:Namespace(disp_batches=1, gpus='0,1', image_shape='3,224,224', kv_store='dist_sync', network='resnet', num_batches=5, num_classes=1000, num_layers=152, optimizer='None', test_results=1) [10:30:56] ../mxnet/dmlc-core/include/dmlc/./logging.h:308: [10:30:56] src/postoffice.cc:16: Check notnull: Environment::Get()->find("DMLC_NUM_WORKER") Stack trace returned 10 entries: [bt] (0) ../mxnet/python/mxnet/../../lib/libmxnet.so(_ZN4dmlc15LogMessageFatalD1Ev+0x3c) [0x7fa9c2b0c25c] [bt] (1)../mxnet/python/mxnet/../../lib/libmxnet.so(_ZN2ps10PostofficeC1Ev+0x1dfe) [0x7fa9c3e1c9de] [bt] (2) ../mxnet/python/mxnet/../../lib/libmxnet.so(_ZN2ps8CustomerC2EiRKSt8functionIFvRKNS_7MessageEEE+0x8f3) [0x7fa9c3e15183] [bt] (3) ../mxnet/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet7KVStore6CreateEPKc+0x5b0) [0x7fa9c3d67470] [bt] (4) ../mxnet/python/mxnet/../../lib/libmxnet.so(MXKVStoreCreate+0x9) [0x7fa9c3d12449] [bt] (5) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call_unix64+0x4c) [0x7fa9f3450e40] [bt] (6) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call+0x2eb) [0x7fa9f34508ab] [bt] (7) /usr/lib/python2.7/lib-dynload/_ctypes.x86_64-linux-gnu.so(_ctypes_callproc+0x48f) [0x7fa9f36603df] [bt] (8) /usr/lib/python2.7/lib-dynload/_ctypes.x86_64-linux-gnu.so(+0x11d82) [0x7fa9f3664d82] [bt] (9) python(PyObject_Call+0x43) [0x4b0cb3] Traceback (most recent call last): File "measure.py", line 144, in <module> run(**vars(args)) File "measure.py", line 79, in run kv = mx.kv.create(kv_store) File "../mxnet/tools/bandwidth/../../python/mxnet/kvstore.py", line 513, in create ctypes.byref(handle))) File "../mxnet/tools/bandwidth/../../python/mxnet/base.py", line 143, in check_call raise MXNetError(py_str(_LIB.MXGetLastError())) mxnet.base.MXNetError: [10:30:56] src/postoffice.cc:16: Check notnull: Environment::Get()->find("DMLC_NUM_WORKER") Stack trace returned 10 entries: [bt] (0) ../mxnet/python/mxnet/../../lib/libmxnet.so(_ZN4dmlc15LogMessageFatalD1Ev+0x3c) [0x7fa9c2b0c25c] [bt] (1) ../mxnet/python/mxnet/../../lib/libmxnet.so(_ZN2ps10PostofficeC1Ev+0x1dfe) [0x7fa9c3e1c9de] [bt] (2) ../mxnet/python/mxnet/../../lib/libmxnet.so(_ZN2ps8CustomerC2EiRKSt8functionIFvRKNS_7MessageEEE+0x8f3) [0x7fa9c3e15183] [bt] (3) ../mxnet/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet7KVStore6CreateEPKc+0x5b0) [0x7fa9c3d67470] [bt] (4)../mxnet/python/mxnet/../../lib/libmxnet.so(MXKVStoreCreate+0x9) [0x7fa9c3d12449] [bt] (5) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call_unix64+0x4c) [0x7fa9f3450e40] [bt] (6) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call+0x2eb) [0x7fa9f34508ab] [bt] (7) /usr/lib/python2.7/lib-dynload/_ctypes.x86_64-linux-gnu.so(_ctypes_callproc+0x48f) [0x7fa9f36603df] [bt] (8) /usr/lib/python2.7/lib-dynload/_ctypes.x86_64-linux-gnu.so(+0x11d82) [0x7fa9f3664d82] [bt] (9) python(PyObject_Call+0x43) [0x4b0cb3] Exception in thread Thread-2: Traceback (most recent call last): File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner self.run() File "/usr/lib/python2.7/threading.py", line 754, in run self.__target(*self.__args, **self.__kwargs) File "/vol/vssp/project/Chan/mxnet/tools/../dmlc-core/tracker/dmlc_tracker/ssh.py", line 60, in run subprocess.check_call(prog, shell = True) File "/usr/lib/python2.7/subprocess.py", line 541, in check_call raise CalledProcessError(retcode, cmd) CalledProcessError: Command 'ssh -o StrictHostKeyChecking=no 131.227.80.58 -p 22 'export LD_LIBRARY_PATH=/vol/vssp/deepface/cuda/lib64/:/vol/vssp/deepface/nccl/build/lib/:/usr/local/cuda/lib64:/opt/Papillon/lib/:/vol/vssp/project/Chan/caffe-face/build/lib:/usr/lib//x86_64-linux-gnu/:/opt/lib:/usr/local/db4.8/lib; export DMLC_ROLE=server; export DMLC_PS_ROOT_PORT=9125; export DMLC_PS_ROOT_URI=131.227.80.54; export DMLC_NUM_SERVER=1; export DMLC_NUM_WORKER=1; cd /vol/vssp/project/Chan/mxnet/tools/bandwidth/; python measure.py --kv-store dist_sync'' returned non-zero exit status 1 ^C2017-08-25 10:31:01,457 INFO Stop launcher ## Minimum reproducible example if you are using your own code, please provide a short script that reproduces the error. ## Steps to reproduce or if you are running standard examples, please provide the commands you have run that lead to the error. 1. 2. 3. ## What have you tried to solve it? 1. 2. 3. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
With regards, Apache Git Services