Hi Kenneth,

I can confirm that with EasyBuild v4.4.0 the PyTorch 1.8.1 installation went smoothly and without any problems:

$ eb PyTorch-1.8.1-foss-2020b.eb -r

Best regards,
Ole

On 6/1/21 5:13 PM, Kenneth Hoste wrote:
Hi Ole,

This error doesn't mean anything in particular for me, but perhaps it rings a bell for Alexander (in CC).

There are a couple of fixes related to PyTorch that will be included in the upcoming EasyBuild v4.4.0 release (which will be released tomorrow hopefully), so keep an eye out for that...


regards,

Kenneth


On 01/06/2021 09:56, Ole Holm Nielsen wrote:
Dear EasyBuilders,

I'm trying to build PyTorch-1.7.1-fosscuda-2020b.eb on a CentOS 7 server with some Nvidia GPUs, and the build fails in the tests after about 2 hours:

$ eb PyTorch-1.7.1-fosscuda-2020b.eb -r
== Temporary log file in case of crash /tmp/eb-zAAAvr/easybuild-TDNRVQ.log
== found valid index for /home/modules/software/EasyBuild/4.3.4/easybuild/easyconfigs, so using it... == found valid index for /home/modules/software/EasyBuild/4.3.4/easybuild/easyconfigs, so using it...
== resolving dependencies ...
== processing EasyBuild easyconfig /home/modules/software/EasyBuild/4.3.4/easybuild/easyconfigs/p/PyTorch/PyTorch-1.7.1-fosscuda-2020b.eb
== building and installing PyTorch/1.7.1-fosscuda-2020b...
== fetching files...
== creating build dir, resetting environment...
== unpacking...
== patching...
== preparing...
== configuring...
== building...
== testing...
== FAILED: Installation ended unsuccessfully (build directory: /dev/shm/PyTorch/1.7.1/fosscuda-2020b): build failed (first 300 chars): cmd "export PYTHONPATH=/tmp/eb-zAAAvr/tmpnh77Vl/lib/python3.8/site-packages:$PYTHONPATH &&  cd test && PYTHONUNBUFFERED=1 /home/modules/software/Python/3.8.6-GCCcore-10.2.0/bin/python run_test.py --verbose -x distributed/rpc/test_process_group_agent test_quantization " exited with exit code 1 and ou (took 1 hour 59 min 46 sec) == Results of the build can be found in the log file(s) /tmp/eb-zAAAvr/easybuild-PyTorch-1.7.1-20210601.074610.WfkGf.log ERROR: Build of /home/modules/software/EasyBuild/4.3.4/easybuild/easyconfigs/p/PyTorch/PyTorch-1.7.1-fosscuda-2020b.eb failed (err: 'build failed (first 300 chars): cmd "export PYTHONPATH=/tmp/eb-zAAAvr/tmpnh77Vl/lib/python3.8/site-packages:$PYTHONPATH &&  cd test && PYTHONUNBUFFERED=1 /home/modules/software/Python/3.8.6-GCCcore-10.2.0/bin/python run_test.py --verbose -x distributed/rpc/test_process_group_agent test_quantization " exited with exit code 1 and ou')


The EB log file shows these 4 errors at the end of the file:

======================================================================
ERROR: test_DistributedDataParallel (__main__.TestDistBackendWithFork)
----------------------------------------------------------------------
Traceback (most recent call last):
   File "/tmp/eb-zAAAvr/tmpnh77Vl/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py", line 267, in wrapper
     self._join_processes(fn)
   File "/tmp/eb-zAAAvr/tmpnh77Vl/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py", line 384, in _join_processes
     self._check_return_codes(elapsed_time)
   File "/tmp/eb-zAAAvr/tmpnh77Vl/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py", line 420, in _check_return_codes
     raise RuntimeError(error)
RuntimeError: Processes 0 1 2 exited with error code 10

======================================================================
ERROR: test_DistributedDataParallel_SyncBatchNorm (__main__.TestDistBackendWithFork)
----------------------------------------------------------------------
Traceback (most recent call last):
   File "/tmp/eb-zAAAvr/tmpnh77Vl/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py", line 267, in wrapper
     self._join_processes(fn)
   File "/tmp/eb-zAAAvr/tmpnh77Vl/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py", line 384, in _join_processes
     self._check_return_codes(elapsed_time)
   File "/tmp/eb-zAAAvr/tmpnh77Vl/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py", line 420, in _check_return_codes
     raise RuntimeError(error)
RuntimeError: Processes 0 1 2 exited with error code 10

======================================================================
ERROR: test_DistributedDataParallel_SyncBatchNorm_Diff_Input_Sizes_gradient (__main__.TestDistBackendWithFork)
----------------------------------------------------------------------
Traceback (most recent call last):
   File "/tmp/eb-zAAAvr/tmpnh77Vl/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py", line 267, in wrapper
     self._join_processes(fn)
   File "/tmp/eb-zAAAvr/tmpnh77Vl/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py", line 384, in _join_processes
     self._check_return_codes(elapsed_time)
   File "/tmp/eb-zAAAvr/tmpnh77Vl/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py", line 420, in _check_return_codes
     raise RuntimeError(error)
RuntimeError: Processes 0 1 2 exited with error code 10

======================================================================
ERROR: test_DistributedDataParallel_with_grad_is_view (__main__.TestDistBackendWithFork)
----------------------------------------------------------------------
Traceback (most recent call last):
   File "/tmp/eb-zAAAvr/tmpnh77Vl/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py", line 267, in wrapper
     self._join_processes(fn)
   File "/tmp/eb-zAAAvr/tmpnh77Vl/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py", line 384, in _join_processes
     self._check_return_codes(elapsed_time)
   File "/tmp/eb-zAAAvr/tmpnh77Vl/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py", line 420, in _check_return_codes
     raise RuntimeError(error)
RuntimeError: Processes 0 1 2 exited with error code 10

----------------------------------------------------------------------
Ran 134 tests in 286.115s

FAILED (errors=4, skipped=91)
Traceback (most recent call last):
   File "run_test.py", line 745, in <module>
     main()
   File "run_test.py", line 728, in main
     raise RuntimeError(err_message)
RuntimeError: distributed/test_distributed_fork failed!
  (at easybuild/tools/run.py:537 in parse_cmd_output)
== 2021-06-01 09:45:57,406 filetools.py:1810 INFO Removing lock /home/modules/software/.locks/_home_modules_software_PyTorch_1.7.1-fosscuda-2020b.lock... == 2021-06-01 09:45:57,407 filetools.py:347 INFO Path /home/modules/software/.locks/_home_modules_software_PyTorch_1.7.1-fosscuda-2020b.lock successfully removed. == 2021-06-01 09:45:57,407 filetools.py:1814 INFO Lock removed: /home/modules/software/.locks/_home_modules_software_PyTorch_1.7.1-fosscuda-2020b.lock == 2021-06-01 09:45:57,407 easyblock.py:3414 WARNING build failed (first 300 chars): cmd "export PYTHONPATH=/tmp/eb-zAAAvr/tmpnh77Vl/lib/python3.8/site-packages:$PYTHONPATH &&  cd test && PYTHONUNBUFFERED=1 /home/modules/software/Python/3.8.6-GCCcore-10.2.0/bin/python run_test.py --verbose -x distributed/rpc/test_process_group_agent test_quantization " exited with exit code 1 and ou == 2021-06-01 09:45:57,407 easyblock.py:298 INFO Closing log for application name PyTorch version 1.7.1


Question: Does anyone know how to fix these errors?

Thanks,
Ole


--
Ole Holm Nielsen
PhD, Senior HPC Officer
Department of Physics, Technical University of Denmark,
Fysikvej Building 309, DK-2800 Kongens Lyngby, Denmark
E-mail: ole.h.niel...@fysik.dtu.dk
Homepage: http://dcwww.fysik.dtu.dk/~ohnielse/
Mobile: (+45) 5180 1620

Reply via email to