Hi Kenneth,

Thanks, but PR 12814 may still have an issue with testing network ports, see https://github.com/easybuilders/easybuild-easyconfigs/pull/12814

I get an error about a TCP port being in use in the log file. On CentOS 7.9 I always see network ports being in use after I close down Flexlm license servers, but after a couple of minutes they're free again. I don't know if there's a configurable TCP port timeout in CentOS/RHEL 7? I think my system uses default values.

Log file:

(lines deleted)
Running distributed/test_jit_c10d ... [2021-06-03 11:45:18.408449]
Executing ['/home/modules/software/Python/3.8.6-GCCcore-10.2.0/bin/python', 'distributed/test_jit_c10d.py', '-v'] ... [2021-06-03 11:45:18.408656]
test_frontend_singleton (__main__.C10dFrontendJitTest) ... ok
test_process_group_as_module_member (__main__.C10dProcessGroupSerialization) ... ERROR test_init_process_group_nccl_as_base_process_group_torchbind (__main__.ProcessGroupNCCLJitTest) ... ok test_init_process_group_nccl_torchbind (__main__.ProcessGroupNCCLJitTest) ... ok test_process_group_nccl_as_base_process_group_torchbind_alltoall (__main__.ProcessGroupNCCLJitTest) ... ok test_process_group_nccl_serialization (__main__.ProcessGroupNCCLJitTest) ... ok test_process_group_nccl_torchbind_alltoall (__main__.ProcessGroupNCCLJitTest) ... ok
test_create_file_store (__main__.StoreTest) ... ok
test_create_prefix_store (__main__.StoreTest) ... ok

======================================================================
ERROR: test_process_group_as_module_member (__main__.C10dProcessGroupSerialization)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/tmp/eb-3tUIrb/tmpFJsxiC/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py", line 146, in wrapper
    return func(*args, **kwargs)
File "distributed/test_jit_c10d.py", line 228, in test_process_group_as_module_member
    self.checkModule(TestModule(), (torch.rand((2, 3)),))
  File "distributed/test_jit_c10d.py", line 216, in __init__
    tcp_store = _create_tcp_store()
  File "distributed/test_jit_c10d.py", line 40, in _create_tcp_store
return torch.classes.dist_c10d.TCPStore(addr, port, 1, True, timeout_millisecond)
RuntimeError: Address already in use

----------------------------------------------------------------------
Ran 9 tests in 0.410s

FAILED (errors=1)
Traceback (most recent call last):
  File "run_test.py", line 926, in <module>
    main()
  File "run_test.py", line 905, in main
    raise RuntimeError(err_message)
RuntimeError: distributed/test_jit_c10d failed!
 (at easybuild/tools/run.py:565 in parse_cmd_output)
== 2021-06-03 11:45:23,800 filetools.py:1876 INFO Removing lock /home/modules/software/.locks/_home_modules_software_PyTorch_1.8.1-fosscuda-2020b.lock... == 2021-06-03 11:45:23,802 filetools.py:358 INFO Path /home/modules/software/.locks/_home_modules_software_PyTorch_1.8.1-fosscuda-2020b.lock successfully removed. == 2021-06-03 11:45:23,802 filetools.py:1880 INFO Lock removed: /home/modules/software/.locks/_home_modules_software_PyTorch_1.8.1-fosscuda-2020b.lock == 2021-06-03 11:45:23,802 easyblock.py:3643 WARNING build failed (first 300 chars): cmd "export PYTHONPATH=/tmp/eb-3tUIrb/tmpFJsxiC/lib/python3.8/site-packages:$PYTHONPATH && cd test && PYTHONUNBUFFERED=1 /home/modules/software/Python/3.8.6-GCCcore-10.2.0/bin/python run_test.py --verbose -x distributed/rpc/test_process_group_agent test_quantization " exited with exit code 1 and ou == 2021-06-03 11:45:23,803 easyblock.py:300 INFO Closing log for application name PyTorch version 1.8.1

/Ole

On 6/4/21 12:03 PM, Kenneth Hoste wrote:
See https://github.com/easybuilders/easybuild-easyconfigs/pull/12814

It needs a bit more love though (CI tests are failing currently), but you can try installing that using "eb --from-pr 12814".


regards,

Kenneth

On 03/06/2021 10:17, Alexander Grund wrote:
There is an open PR in the easyconfigs repo. Check that :)

Am 03.06.21 um 10:16 schrieb Ole Holm Nielsen:
Our users report an error with this PyTorch module:

AssertionError: Torch not compiled with CUDA enabled

Are there any plans to make a PyTorch-1.8.1 module with the fosscuda toolchain?

Thanks,
Ole


On 6/3/21 8:54 AM, Kenneth Hoste wrote:
Excellent, that's great to hear, thanks for the update Ole!


regards,

Kenneth

On 03/06/2021 07:42, Ole Holm Nielsen wrote:
Hi Kenneth,

I can confirm that with EasyBuild v4.4.0 the PyTorch 1.8.1 installation went smoothly and without any problems:

$ eb PyTorch-1.8.1-foss-2020b.eb -r

Best regards,
Ole

On 6/1/21 5:13 PM, Kenneth Hoste wrote:
Hi Ole,

This error doesn't mean anything in particular for me, but perhaps it rings a bell for Alexander (in CC).

There are a couple of fixes related to PyTorch that will be included in the upcoming EasyBuild v4.4.0 release (which will be released tomorrow hopefully), so keep an eye out for that...


regards,

Kenneth


On 01/06/2021 09:56, Ole Holm Nielsen wrote:
Dear EasyBuilders,

I'm trying to build PyTorch-1.7.1-fosscuda-2020b.eb on a CentOS 7 server with some Nvidia GPUs, and the build fails in the tests after about 2 hours:

$ eb PyTorch-1.7.1-fosscuda-2020b.eb -r
== Temporary log file in case of crash /tmp/eb-zAAAvr/easybuild-TDNRVQ.log == found valid index for /home/modules/software/EasyBuild/4.3.4/easybuild/easyconfigs, so using it... == found valid index for /home/modules/software/EasyBuild/4.3.4/easybuild/easyconfigs, so using it...
== resolving dependencies ...
== processing EasyBuild easyconfig /home/modules/software/EasyBuild/4.3.4/easybuild/easyconfigs/p/PyTorch/PyTorch-1.7.1-fosscuda-2020b.eb
== building and installing PyTorch/1.7.1-fosscuda-2020b...
== fetching files...
== creating build dir, resetting environment...
== unpacking...
== patching...
== preparing...
== configuring...
== building...
== testing...
== FAILED: Installation ended unsuccessfully (build directory: /dev/shm/PyTorch/1.7.1/fosscuda-2020b): build failed (first 300 chars): cmd "export PYTHONPATH=/tmp/eb-zAAAvr/tmpnh77Vl/lib/python3.8/site-packages:$PYTHONPATH &&  cd test && PYTHONUNBUFFERED=1 /home/modules/software/Python/3.8.6-GCCcore-10.2.0/bin/python run_test.py --verbose -x distributed/rpc/test_process_group_agent test_quantization " exited with exit code 1 and ou (took 1 hour 59 min 46 sec) == Results of the build can be found in the log file(s) /tmp/eb-zAAAvr/easybuild-PyTorch-1.7.1-20210601.074610.WfkGf.log ERROR: Build of /home/modules/software/EasyBuild/4.3.4/easybuild/easyconfigs/p/PyTorch/PyTorch-1.7.1-fosscuda-2020b.eb failed (err: 'build failed (first 300 chars): cmd "export PYTHONPATH=/tmp/eb-zAAAvr/tmpnh77Vl/lib/python3.8/site-packages:$PYTHONPATH &&  cd test && PYTHONUNBUFFERED=1 /home/modules/software/Python/3.8.6-GCCcore-10.2.0/bin/python run_test.py --verbose -x distributed/rpc/test_process_group_agent test_quantization " exited with exit code 1 and ou')


The EB log file shows these 4 errors at the end of the file:

======================================================================
ERROR: test_DistributedDataParallel (__main__.TestDistBackendWithFork)
----------------------------------------------------------------------
Traceback (most recent call last):
   File "/tmp/eb-zAAAvr/tmpnh77Vl/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py", line 267, in wrapper
     self._join_processes(fn)
   File "/tmp/eb-zAAAvr/tmpnh77Vl/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py", line 384, in _join_processes
     self._check_return_codes(elapsed_time)
   File "/tmp/eb-zAAAvr/tmpnh77Vl/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py", line 420, in _check_return_codes
     raise RuntimeError(error)
RuntimeError: Processes 0 1 2 exited with error code 10

======================================================================
ERROR: test_DistributedDataParallel_SyncBatchNorm (__main__.TestDistBackendWithFork)
----------------------------------------------------------------------
Traceback (most recent call last):
   File "/tmp/eb-zAAAvr/tmpnh77Vl/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py", line 267, in wrapper
     self._join_processes(fn)
   File "/tmp/eb-zAAAvr/tmpnh77Vl/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py", line 384, in _join_processes
     self._check_return_codes(elapsed_time)
   File "/tmp/eb-zAAAvr/tmpnh77Vl/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py", line 420, in _check_return_codes
     raise RuntimeError(error)
RuntimeError: Processes 0 1 2 exited with error code 10

======================================================================
ERROR: test_DistributedDataParallel_SyncBatchNorm_Diff_Input_Sizes_gradient (__main__.TestDistBackendWithFork)
----------------------------------------------------------------------
Traceback (most recent call last):
   File "/tmp/eb-zAAAvr/tmpnh77Vl/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py", line 267, in wrapper
     self._join_processes(fn)
   File "/tmp/eb-zAAAvr/tmpnh77Vl/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py", line 384, in _join_processes
     self._check_return_codes(elapsed_time)
   File "/tmp/eb-zAAAvr/tmpnh77Vl/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py", line 420, in _check_return_codes
     raise RuntimeError(error)
RuntimeError: Processes 0 1 2 exited with error code 10

======================================================================
ERROR: test_DistributedDataParallel_with_grad_is_view (__main__.TestDistBackendWithFork)
----------------------------------------------------------------------
Traceback (most recent call last):
   File "/tmp/eb-zAAAvr/tmpnh77Vl/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py", line 267, in wrapper
     self._join_processes(fn)
   File "/tmp/eb-zAAAvr/tmpnh77Vl/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py", line 384, in _join_processes
     self._check_return_codes(elapsed_time)
   File "/tmp/eb-zAAAvr/tmpnh77Vl/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py", line 420, in _check_return_codes
     raise RuntimeError(error)
RuntimeError: Processes 0 1 2 exited with error code 10

----------------------------------------------------------------------
Ran 134 tests in 286.115s

FAILED (errors=4, skipped=91)
Traceback (most recent call last):
   File "run_test.py", line 745, in <module>
     main()
   File "run_test.py", line 728, in main
     raise RuntimeError(err_message)
RuntimeError: distributed/test_distributed_fork failed!
  (at easybuild/tools/run.py:537 in parse_cmd_output)
== 2021-06-01 09:45:57,406 filetools.py:1810 INFO Removing lock /home/modules/software/.locks/_home_modules_software_PyTorch_1.7.1-fosscuda-2020b.lock... == 2021-06-01 09:45:57,407 filetools.py:347 INFO Path /home/modules/software/.locks/_home_modules_software_PyTorch_1.7.1-fosscuda-2020b.lock successfully removed. == 2021-06-01 09:45:57,407 filetools.py:1814 INFO Lock removed: /home/modules/software/.locks/_home_modules_software_PyTorch_1.7.1-fosscuda-2020b.lock == 2021-06-01 09:45:57,407 easyblock.py:3414 WARNING build failed (first 300 chars): cmd "export PYTHONPATH=/tmp/eb-zAAAvr/tmpnh77Vl/lib/python3.8/site-packages:$PYTHONPATH &&  cd test && PYTHONUNBUFFERED=1 /home/modules/software/Python/3.8.6-GCCcore-10.2.0/bin/python run_test.py --verbose -x distributed/rpc/test_process_group_agent test_quantization " exited with exit code 1 and ou == 2021-06-01 09:45:57,407 easyblock.py:298 INFO Closing log for application name PyTorch version 1.7.1


Question: Does anyone know how to fix these errors?

Thanks,
Ole




--
Ole Holm Nielsen
PhD, Senior HPC Officer
Department of Physics, Technical University of Denmark,
Fysikvej Building 309, DK-2800 Kongens Lyngby, Denmark
E-mail: ole.h.niel...@fysik.dtu.dk
Homepage: http://dcwww.fysik.dtu.dk/~ohnielse/
Mobile: (+45) 5180 1620

Reply via email to