Dear EasyBuilders,
I'm trying to build PyTorch-1.7.1-fosscuda-2020b.eb on a CentOS 7 server
with some Nvidia GPUs, and the build fails in the tests after about 2
hours:
$ eb PyTorch-1.7.1-fosscuda-2020b.eb -r
== Temporary log file in case of crash /tmp/eb-zAAAvr/easybuild-TDNRVQ.log
== found valid index for
/home/modules/software/EasyBuild/4.3.4/easybuild/easyconfigs, so using
it...
== found valid index for
/home/modules/software/EasyBuild/4.3.4/easybuild/easyconfigs, so using
it...
== resolving dependencies ...
== processing EasyBuild easyconfig
/home/modules/software/EasyBuild/4.3.4/easybuild/easyconfigs/p/PyTorch/PyTorch-1.7.1-fosscuda-2020b.eb
== building and installing PyTorch/1.7.1-fosscuda-2020b...
== fetching files...
== creating build dir, resetting environment...
== unpacking...
== patching...
== preparing...
== configuring...
== building...
== testing...
== FAILED: Installation ended unsuccessfully (build directory:
/dev/shm/PyTorch/1.7.1/fosscuda-2020b): build failed (first 300 chars):
cmd "export
PYTHONPATH=/tmp/eb-zAAAvr/tmpnh77Vl/lib/python3.8/site-packages:$PYTHONPATH
&& cd test && PYTHONUNBUFFERED=1
/home/modules/software/Python/3.8.6-GCCcore-10.2.0/bin/python
run_test.py --verbose -x distributed/rpc/test_process_group_agent
test_quantization " exited with exit code 1 and ou (took 1 hour 59 min
46 sec)
== Results of the build can be found in the log file(s)
/tmp/eb-zAAAvr/easybuild-PyTorch-1.7.1-20210601.074610.WfkGf.log
ERROR: Build of
/home/modules/software/EasyBuild/4.3.4/easybuild/easyconfigs/p/PyTorch/PyTorch-1.7.1-fosscuda-2020b.eb
failed (err: 'build failed (first 300 chars): cmd "export
PYTHONPATH=/tmp/eb-zAAAvr/tmpnh77Vl/lib/python3.8/site-packages:$PYTHONPATH
&& cd test && PYTHONUNBUFFERED=1
/home/modules/software/Python/3.8.6-GCCcore-10.2.0/bin/python
run_test.py --verbose -x distributed/rpc/test_process_group_agent
test_quantization " exited with exit code 1 and ou')
The EB log file shows these 4 errors at the end of the file:
======================================================================
ERROR: test_DistributedDataParallel (__main__.TestDistBackendWithFork)
----------------------------------------------------------------------
Traceback (most recent call last):
File
"/tmp/eb-zAAAvr/tmpnh77Vl/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py",
line 267, in wrapper
self._join_processes(fn)
File
"/tmp/eb-zAAAvr/tmpnh77Vl/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py",
line 384, in _join_processes
self._check_return_codes(elapsed_time)
File
"/tmp/eb-zAAAvr/tmpnh77Vl/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py",
line 420, in _check_return_codes
raise RuntimeError(error)
RuntimeError: Processes 0 1 2 exited with error code 10
======================================================================
ERROR: test_DistributedDataParallel_SyncBatchNorm
(__main__.TestDistBackendWithFork)
----------------------------------------------------------------------
Traceback (most recent call last):
File
"/tmp/eb-zAAAvr/tmpnh77Vl/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py",
line 267, in wrapper
self._join_processes(fn)
File
"/tmp/eb-zAAAvr/tmpnh77Vl/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py",
line 384, in _join_processes
self._check_return_codes(elapsed_time)
File
"/tmp/eb-zAAAvr/tmpnh77Vl/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py",
line 420, in _check_return_codes
raise RuntimeError(error)
RuntimeError: Processes 0 1 2 exited with error code 10
======================================================================
ERROR:
test_DistributedDataParallel_SyncBatchNorm_Diff_Input_Sizes_gradient
(__main__.TestDistBackendWithFork)
----------------------------------------------------------------------
Traceback (most recent call last):
File
"/tmp/eb-zAAAvr/tmpnh77Vl/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py",
line 267, in wrapper
self._join_processes(fn)
File
"/tmp/eb-zAAAvr/tmpnh77Vl/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py",
line 384, in _join_processes
self._check_return_codes(elapsed_time)
File
"/tmp/eb-zAAAvr/tmpnh77Vl/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py",
line 420, in _check_return_codes
raise RuntimeError(error)
RuntimeError: Processes 0 1 2 exited with error code 10
======================================================================
ERROR: test_DistributedDataParallel_with_grad_is_view
(__main__.TestDistBackendWithFork)
----------------------------------------------------------------------
Traceback (most recent call last):
File
"/tmp/eb-zAAAvr/tmpnh77Vl/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py",
line 267, in wrapper
self._join_processes(fn)
File
"/tmp/eb-zAAAvr/tmpnh77Vl/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py",
line 384, in _join_processes
self._check_return_codes(elapsed_time)
File
"/tmp/eb-zAAAvr/tmpnh77Vl/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py",
line 420, in _check_return_codes
raise RuntimeError(error)
RuntimeError: Processes 0 1 2 exited with error code 10
----------------------------------------------------------------------
Ran 134 tests in 286.115s
FAILED (errors=4, skipped=91)
Traceback (most recent call last):
File "run_test.py", line 745, in <module>
main()
File "run_test.py", line 728, in main
raise RuntimeError(err_message)
RuntimeError: distributed/test_distributed_fork failed!
(at easybuild/tools/run.py:537 in parse_cmd_output)
== 2021-06-01 09:45:57,406 filetools.py:1810 INFO Removing lock
/home/modules/software/.locks/_home_modules_software_PyTorch_1.7.1-fosscuda-2020b.lock...
== 2021-06-01 09:45:57,407 filetools.py:347 INFO Path
/home/modules/software/.locks/_home_modules_software_PyTorch_1.7.1-fosscuda-2020b.lock
successfully removed.
== 2021-06-01 09:45:57,407 filetools.py:1814 INFO Lock removed:
/home/modules/software/.locks/_home_modules_software_PyTorch_1.7.1-fosscuda-2020b.lock
== 2021-06-01 09:45:57,407 easyblock.py:3414 WARNING build failed (first
300 chars): cmd "export
PYTHONPATH=/tmp/eb-zAAAvr/tmpnh77Vl/lib/python3.8/site-packages:$PYTHONPATH
&& cd test && PYTHONUNBUFFERED=1
/home/modules/software/Python/3.8.6-GCCcore-10.2.0/bin/python
run_test.py --verbose -x distributed/rpc/test_process_group_agent
test_quantization " exited with exit code 1 and ou
== 2021-06-01 09:45:57,407 easyblock.py:298 INFO Closing log for
application name PyTorch version 1.7.1
Question: Does anyone know how to fix these errors?
Thanks,
Ole