Re: [easybuild] Build failure with PyTorch-1.7.1-fosscuda-2020b.eb

2021-06-02 Thread Ole Holm Nielsen

Hi Kenneth,

I can confirm that with EasyBuild v4.4.0 the PyTorch 1.8.1 installation 
went smoothly and without any problems:


$ eb PyTorch-1.8.1-foss-2020b.eb -r

Best regards,
Ole

On 6/1/21 5:13 PM, Kenneth Hoste wrote:

Hi Ole,

This error doesn't mean anything in particular for me, but perhaps it 
rings a bell for Alexander (in CC).


There are a couple of fixes related to PyTorch that will be included in 
the upcoming EasyBuild v4.4.0 release (which will be released tomorrow 
hopefully), so keep an eye out for that...



regards,

Kenneth


On 01/06/2021 09:56, Ole Holm Nielsen wrote:

Dear EasyBuilders,

I'm trying to build PyTorch-1.7.1-fosscuda-2020b.eb on a CentOS 7 server 
with some Nvidia GPUs, and the build fails in the tests after about 2 
hours:


$ eb PyTorch-1.7.1-fosscuda-2020b.eb -r
== Temporary log file in case of crash /tmp/eb-zAAAvr/easybuild-TDNRVQ.log
== found valid index for 
/home/modules/software/EasyBuild/4.3.4/easybuild/easyconfigs, so using 
it...
== found valid index for 
/home/modules/software/EasyBuild/4.3.4/easybuild/easyconfigs, so using 
it...

== resolving dependencies ...
== processing EasyBuild easyconfig 
/home/modules/software/EasyBuild/4.3.4/easybuild/easyconfigs/p/PyTorch/PyTorch-1.7.1-fosscuda-2020b.eb 


== building and installing PyTorch/1.7.1-fosscuda-2020b...
== fetching files...
== creating build dir, resetting environment...
== unpacking...
== patching...
== preparing...
== configuring...
== building...
== testing...
== FAILED: Installation ended unsuccessfully (build directory: 
/dev/shm/PyTorch/1.7.1/fosscuda-2020b): build failed (first 300 chars): 
cmd "export 
PYTHONPATH=/tmp/eb-zAAAvr/tmpnh77Vl/lib/python3.8/site-packages:$PYTHONPATH 
&&  cd test && PYTHONUNBUFFERED=1 
/home/modules/software/Python/3.8.6-GCCcore-10.2.0/bin/python 
run_test.py --verbose -x distributed/rpc/test_process_group_agent 
test_quantization " exited with exit code 1 and ou (took 1 hour 59 min 
46 sec)
== Results of the build can be found in the log file(s) 
/tmp/eb-zAAAvr/easybuild-PyTorch-1.7.1-20210601.074610.WfkGf.log
ERROR: Build of 
/home/modules/software/EasyBuild/4.3.4/easybuild/easyconfigs/p/PyTorch/PyTorch-1.7.1-fosscuda-2020b.eb 
failed (err: 'build failed (first 300 chars): cmd "export 
PYTHONPATH=/tmp/eb-zAAAvr/tmpnh77Vl/lib/python3.8/site-packages:$PYTHONPATH 
&&  cd test && PYTHONUNBUFFERED=1 
/home/modules/software/Python/3.8.6-GCCcore-10.2.0/bin/python 
run_test.py --verbose -x distributed/rpc/test_process_group_agent 
test_quantization " exited with exit code 1 and ou')



The EB log file shows these 4 errors at the end of the file:

==
ERROR: test_DistributedDataParallel (__main__.TestDistBackendWithFork)
--
Traceback (most recent call last):
   File 
"/tmp/eb-zAAAvr/tmpnh77Vl/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py", 
line 267, in wrapper

 self._join_processes(fn)
   File 
"/tmp/eb-zAAAvr/tmpnh77Vl/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py", 
line 384, in _join_processes

 self._check_return_codes(elapsed_time)
   File 
"/tmp/eb-zAAAvr/tmpnh77Vl/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py", 
line 420, in _check_return_codes

 raise RuntimeError(error)
RuntimeError: Processes 0 1 2 exited with error code 10

==
ERROR: test_DistributedDataParallel_SyncBatchNorm 
(__main__.TestDistBackendWithFork)

--
Traceback (most recent call last):
   File 
"/tmp/eb-zAAAvr/tmpnh77Vl/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py", 
line 267, in wrapper

 self._join_processes(fn)
   File 
"/tmp/eb-zAAAvr/tmpnh77Vl/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py", 
line 384, in _join_processes

 self._check_return_codes(elapsed_time)
   File 
"/tmp/eb-zAAAvr/tmpnh77Vl/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py", 
line 420, in _check_return_codes

 raise RuntimeError(error)
RuntimeError: Processes 0 1 2 exited with error code 10

==
ERROR: 
test_DistributedDataParallel_SyncBatchNorm_Diff_Input_Sizes_gradient 
(__main__.TestDistBackendWithFork)

--
Traceback (most recent call last):
   File 
"/tmp/eb-zAAAvr/tmpnh77Vl/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py", 
line 267, in wrapper

 self._join_processes(fn)
   File 
"/tmp/eb-zAAAvr/tmpnh77Vl/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py", 
line 384, in _join_processes

 self._check_return_codes(elapsed_time)
   File 

Re: [easybuild] Troubles with PySCF

2021-06-02 Thread Agustín Aucar
Dear George,

Thanks for your response. A few days ago, I tried to compile the code in a
slave node, but it didn't solve the problem...

Best,
Agustín

El mié, 2 jun 2021 a las 11:41, George Tsouloupas ()
escribió:

> Hi,
>
> In a similar situation we ended up just building the software on the
> "older" cpu (i.e. the "slave" in your case)
>
> G.
>
>
> George Tsouloupas, PhD
> HPC Facility Technical Director
> The Cyprus Institute
> tel: +357 22208688
>
> On 6/2/21 4:22 PM, Agustín Aucar wrote:
>
> Dear EasyBuild experts,
>
> Firstly, thank you for your very nice work!
>
> I'm trying to compile PySCF with the following *.eb file:
>
> easyblock = 'CMakeMakeCp'
>
> name = 'PySCF'
> version = '2.0.0a'
> versionsuffix = '-Python-%(pyver)s'
>
> homepage = 'http://www.pyscf.org'
> description = "PySCF is an open-source collection of electronic structure
> modules powered by Python."
>
> toolchain = {'name': 'foss', 'version': '2020b'}
>
> source_urls = ['https://github.com/pyscf/pyscf/archive/']
> sources = ['v%(version)s.tar.gz']
> checksums =
> ['20f4c9faf65436a97f9dfc8099d3c79b988b0a2c5374c701fbe35abc6fad4922']
>
> builddependencies = [('CMake', '3.18.4')]
>
> dependencies = [
> ('Python', '3.8.6'),
> ('SciPy-bundle', '2020.11'),  # for numpy, scipy
> ('h5py', '3.1.0'),
> ('qcint', '4.0.6', versionsuffix),
> ('libxc', '5.1.3'),
> ('XCFun', '2.1.1'),
> ]
>
> start_dir = 'pyscf/lib'
>
> separate_build_dir = True
>
> configopts = "-DBUILD_LIBCINT=OFF -DBUILD_LIBXC=OFF -DBUILD_XCFUN=OFF "
>
> prebuildopts = "export PYSCF_INC_DIR=$EBROOTQCINT/include:$EBROOTLIBXC/lib
> && "
>
> files_to_copy = ['pyscf']
>
> sanity_check_paths = {
> 'files': ['pyscf/__init__.py'],
> 'dirs': ['pyscf/data', 'pyscf/lib'],
> }
>
> sanity_check_commands = ["python -c 'import pyscf'"]
>
> modextrapaths = {'PYTHONPATH': '', 'PYSCF_EXT_PATH': ''}
>
> moduleclass = 'chem'
>
>
> Even if the module is created, I am having troubles by running it in a
> node different from master. In particular, when I load the module and ran
> the code, it goes all OK:
>
> module load chem/PySCF/2.0.0a-foss-2020b-Python-3.8.6
> python
> from pyscf import gto, scf
> mol = gto.M(atom='H 0 0 0; H 0 0 1')
> mf = scf.RHF(mol).run()
>
> but when I try to run it on a node different from the master, I get:
>
> Python 3.8.6 (default, Jun  1 2021, 16:43:49)
> [GCC 10.2.0] on linux
> Type "help", "copyright", "credits" or "license" for more information.
> >>> from pyscf import gto, scf
> >>> mol = gto.M(atom='H 0 0 0; H 0 0 1')
> >>> mf = scf.RHF(mol).run()
> Illegal instruction (core dumped)
>
> As far as I read in different places, it seems to be related to the
> different architectures of our master and slaves nodes.
>
> If I execute
>
> grep flags -m1 /proc/cpuinfo | cut -d ":" -f 2 | tr '[:upper:]'
> '[:lower:]' | { read FLAGS; OPT="-march=native"; for flag in $FLAGS; do
> case "$flag" in "sse4_1" | "sse4_2" | "ssse3" | "fma" | "cx16" | "popcnt" |
> "avx" | "avx2") OPT+=" -m$flag";; esac; done; MODOPT=${OPT//_/\.}; echo
> "$MODOPT"; }
>
> on the slaves I get: -march=native -mssse3 -mfma -mcx16 -msse4.1 -msse4.2
> -mpopcnt -mavx -mavx2
>
> whereas on the master node we have: -march=native -mcx16
>
> I tried to compile PySCF by adding these lines to my *.eb file:
>
> configopts += "-DBUILD_FLAGS='-march=native -mssse3 -mfma -mcx16 -msse4.1
> -msse4.2 -mpopcnt -mavx -mavx2' "
> configopts += "-DCMAKE_C_FLAGS='-march=native -mssse3 -mfma -mcx16
> -msse4.1 -msse4.2 -mpopcnt -mavx -mavx2' "
> configopts += "-DCMAKE_CXX_FLAGS='-march=native -mssse3 -mfma -mcx16
> -msse4.1 -msse4.2 -mpopcnt -mavx -mavx2' "
> configopts += "-DCMAKE_FORTRAN_FLAGS='-march=native -mssse3 -mfma -mcx16
> -msse4.1 -msse4.2 -mpopcnt -mavx -mavx2'"
>
> but in that case the code does not run on master and neither in slaves.
>
>
> I'm sorry if it is a stupid question. I am far from being a system admin...
>
> Thanks a lot for your help.
>
> Dr. Agustín Aucar
> Institute for Modeling and Innovative Technologies - Argentina
>
>


Re: [easybuild] Troubles with PySCF

2021-06-02 Thread George Tsouloupas

Hi,

In a similar situation we ended up just building the software on the 
"older" cpu (i.e. the "slave" in your case)


G.


George Tsouloupas, PhD
HPC Facility Technical Director
The Cyprus Institute
tel: +357 22208688

On 6/2/21 4:22 PM, Agustín Aucar wrote:

Dear EasyBuild experts,

Firstly, thank you for your very nice work!

I'm trying to compile PySCF with the following *.eb file:

easyblock = 'CMakeMakeCp'

name = 'PySCF'
version = '2.0.0a'
versionsuffix = '-Python-%(pyver)s'

homepage = 'http://www.pyscf.org '
description = "PySCF is an open-source collection of electronic 
structure modules powered by Python."


toolchain = {'name': 'foss', 'version': '2020b'}

source_urls = ['https://github.com/pyscf/pyscf/archive/ 
']

sources = ['v%(version)s.tar.gz']
checksums = 
['20f4c9faf65436a97f9dfc8099d3c79b988b0a2c5374c701fbe35abc6fad4922']


builddependencies = [('CMake', '3.18.4')]

dependencies = [
    ('Python', '3.8.6'),
    ('SciPy-bundle', '2020.11'),  # for numpy, scipy
    ('h5py', '3.1.0'),
    ('qcint', '4.0.6', versionsuffix),
    ('libxc', '5.1.3'),
    ('XCFun', '2.1.1'),
]

start_dir = 'pyscf/lib'

separate_build_dir = True

configopts = "-DBUILD_LIBCINT=OFF -DBUILD_LIBXC=OFF -DBUILD_XCFUN=OFF "

prebuildopts = "export 
PYSCF_INC_DIR=$EBROOTQCINT/include:$EBROOTLIBXC/lib && "


files_to_copy = ['pyscf']

sanity_check_paths = {
    'files': ['pyscf/__init__.py'],
    'dirs': ['pyscf/data', 'pyscf/lib'],
}

sanity_check_commands = ["python -c 'import pyscf'"]

modextrapaths = {'PYTHONPATH': '', 'PYSCF_EXT_PATH': ''}

moduleclass = 'chem'


Even if the module is created, I am having troubles by running it in a 
node different from master. In particular, when I load the module and 
ran the code, it goes all OK:


module load chem/PySCF/2.0.0a-foss-2020b-Python-3.8.6
python
from pyscf import gto, scf
mol = gto.M(atom='H 0 0 0; H 0 0 1')
mf = scf.RHF(mol).run()

but when I try to run it on a node different from the master, I get:

Python 3.8.6 (default, Jun  1 2021, 16:43:49)
[GCC 10.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from pyscf import gto, scf
>>> mol = gto.M(atom='H 0 0 0; H 0 0 1')
>>> mf = scf.RHF(mol).run()
Illegal instruction (core dumped)

As far as I read in different places, it seems to be related to the 
different architectures of our master and slaves nodes.


If I execute

grep flags -m1 /proc/cpuinfo | cut -d ":" -f 2 | tr '[:upper:]' 
'[:lower:]' | { read FLAGS; OPT="-march=native"; for flag in $FLAGS; 
do case "$flag" in "sse4_1" | "sse4_2" | "ssse3" | "fma" | "cx16" | 
"popcnt" | "avx" | "avx2") OPT+=" -m$flag";; esac; done; 
MODOPT=${OPT//_/\.}; echo "$MODOPT"; }


on the slaves I get: -march=native -mssse3 -mfma -mcx16 -msse4.1 
-msse4.2 -mpopcnt -mavx -mavx2


whereas on the master node we have: -march=native -mcx16

I tried to compile PySCF by adding these lines to my *.eb file:

configopts += "-DBUILD_FLAGS='-march=native -mssse3 -mfma -mcx16 
-msse4.1 -msse4.2 -mpopcnt -mavx -mavx2' "
configopts += "-DCMAKE_C_FLAGS='-march=native -mssse3 -mfma -mcx16 
-msse4.1 -msse4.2 -mpopcnt -mavx -mavx2' "
configopts += "-DCMAKE_CXX_FLAGS='-march=native -mssse3 -mfma -mcx16 
-msse4.1 -msse4.2 -mpopcnt -mavx -mavx2' "
configopts += "-DCMAKE_FORTRAN_FLAGS='-march=native -mssse3 -mfma 
-mcx16 -msse4.1 -msse4.2 -mpopcnt -mavx -mavx2'"


but in that case the code does not run on master and neither in slaves.


I'm sorry if it is a stupid question. I am far from being a system 
admin...


Thanks a lot for your help.

Dr. Agustín Aucar
Institute for Modeling and Innovative Technologies - Argentina


[easybuild] Troubles with PySCF

2021-06-02 Thread Agustín Aucar
Dear EasyBuild experts,

Firstly, thank you for your very nice work!

I'm trying to compile PySCF with the following *.eb file:

easyblock = 'CMakeMakeCp'

name = 'PySCF'
version = '2.0.0a'
versionsuffix = '-Python-%(pyver)s'

homepage = 'http://www.pyscf.org'
description = "PySCF is an open-source collection of electronic structure
modules powered by Python."

toolchain = {'name': 'foss', 'version': '2020b'}

source_urls = ['https://github.com/pyscf/pyscf/archive/']
sources = ['v%(version)s.tar.gz']
checksums =
['20f4c9faf65436a97f9dfc8099d3c79b988b0a2c5374c701fbe35abc6fad4922']

builddependencies = [('CMake', '3.18.4')]

dependencies = [
('Python', '3.8.6'),
('SciPy-bundle', '2020.11'),  # for numpy, scipy
('h5py', '3.1.0'),
('qcint', '4.0.6', versionsuffix),
('libxc', '5.1.3'),
('XCFun', '2.1.1'),
]

start_dir = 'pyscf/lib'

separate_build_dir = True

configopts = "-DBUILD_LIBCINT=OFF -DBUILD_LIBXC=OFF -DBUILD_XCFUN=OFF "

prebuildopts = "export PYSCF_INC_DIR=$EBROOTQCINT/include:$EBROOTLIBXC/lib
&& "

files_to_copy = ['pyscf']

sanity_check_paths = {
'files': ['pyscf/__init__.py'],
'dirs': ['pyscf/data', 'pyscf/lib'],
}

sanity_check_commands = ["python -c 'import pyscf'"]

modextrapaths = {'PYTHONPATH': '', 'PYSCF_EXT_PATH': ''}

moduleclass = 'chem'


Even if the module is created, I am having troubles by running it in a node
different from master. In particular, when I load the module and ran the
code, it goes all OK:

module load chem/PySCF/2.0.0a-foss-2020b-Python-3.8.6
python
from pyscf import gto, scf
mol = gto.M(atom='H 0 0 0; H 0 0 1')
mf = scf.RHF(mol).run()

but when I try to run it on a node different from the master, I get:

Python 3.8.6 (default, Jun  1 2021, 16:43:49)
[GCC 10.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from pyscf import gto, scf
>>> mol = gto.M(atom='H 0 0 0; H 0 0 1')
>>> mf = scf.RHF(mol).run()
Illegal instruction (core dumped)

As far as I read in different places, it seems to be related to the
different architectures of our master and slaves nodes.

If I execute

grep flags -m1 /proc/cpuinfo | cut -d ":" -f 2 | tr '[:upper:]' '[:lower:]'
| { read FLAGS; OPT="-march=native"; for flag in $FLAGS; do case "$flag" in
"sse4_1" | "sse4_2" | "ssse3" | "fma" | "cx16" | "popcnt" | "avx" | "avx2")
OPT+=" -m$flag";; esac; done; MODOPT=${OPT//_/\.}; echo "$MODOPT"; }

on the slaves I get: -march=native -mssse3 -mfma -mcx16 -msse4.1 -msse4.2
-mpopcnt -mavx -mavx2

whereas on the master node we have: -march=native -mcx16

I tried to compile PySCF by adding these lines to my *.eb file:

configopts += "-DBUILD_FLAGS='-march=native -mssse3 -mfma -mcx16 -msse4.1
-msse4.2 -mpopcnt -mavx -mavx2' "
configopts += "-DCMAKE_C_FLAGS='-march=native -mssse3 -mfma -mcx16 -msse4.1
-msse4.2 -mpopcnt -mavx -mavx2' "
configopts += "-DCMAKE_CXX_FLAGS='-march=native -mssse3 -mfma -mcx16
-msse4.1 -msse4.2 -mpopcnt -mavx -mavx2' "
configopts += "-DCMAKE_FORTRAN_FLAGS='-march=native -mssse3 -mfma -mcx16
-msse4.1 -msse4.2 -mpopcnt -mavx -mavx2'"

but in that case the code does not run on master and neither in slaves.


I'm sorry if it is a stupid question. I am far from being a system admin...

Thanks a lot for your help.

Dr. Agustín Aucar
Institute for Modeling and Innovative Technologies - Argentina


[easybuild] [ANN] EasyBuild v4.4.0

2021-06-02 Thread Kenneth Hoste

Dear EasyBuilders,

We're pleased to announce the release of EasyBuild v4.4.0 [1].

In my view, and I don't say this lightly, this is the best EasyBuild 
release ever!

Not only because the version number has been increased, but also because:

- it includes initial support for custom toolchains using the Fujitsu 
compiler & libraries, which are leveraged on Fugaku, the current fastest 
(publicly known) supercomputer!


- easyconfig files for the 2021a update of the common toolchains (foss + 
intel) are included;


- during the process of preparing this release, we merged the 10,000th 
pull request in the easyconfigs GitHub repository!


- various small but yet useful enhancements have been made (more info 
below);


- much to our surprise, we found a couple of bugs in previous EasyBuild 
versions, which have been fixed (more info below);



GitHub Actions being flaky or even down for several hours multiple times 
in the recent weeks, including today during the release process, didn't 
prevent us from pushing out this release. Nice try Microsoft!



EasyBuild v4.4.0 is primarily a feature release, but also includes 
several bug fixes & minor enhancements.


Highlights for this release are listed below. More details are available 
in the release notes [2] which includes links to the respective pull 
requests for more detailed information.


(this information is also available at 
https://github.com/easybuilders/easybuild/releases/tag/easybuild-v4.4.0)



## Highlighted enhancements

[enhancements that (may) warrant updating existing installations are 
marked with (***)]


- performance improvements for easyconfig parsing and eb startup;

- support for downloading easyconfigs from multiple PRs with --from-pr;

- support for re-running the sanity check for existing installations 
(without making modifications to the installation), via "eb 
--sanity-check-only";


- toolchain definition for Fujitsu toolchain for use on Fugaku;

- allow checking whether specific libraries are (not) linked into 
installed binaries/libraries in sanity check (see various ways to 
specify banned/required libraries);


- update_build_option function to update specific build options after 
initializing the EasyBuild configuration;


- run post-install commands specified for a specific extension;

- add support for skipping the installation of extension via "eb 
--skip-extensions";


- software-specific easyblocks for FlexiBLAS and dm-reverb;

- custom easyblock to install OpenSSL wrapper for OpenSSL installed in 
OS, with fallback to build and install OpenSSL from source if not 
available in OS;


- enable sanity_pip_check by default for Python easyconfigs if pip >= 
9.0 will be installed;


- add IceLake detection to OpenBLAS 0.3.12 and 0.3.15;



## Prominent bug fixes & changes

[bug fixes or changes that (may) warrant reinstalling easyconfigs are 
marked with (***)]


- re-enable write permissions when installing with read-only-installdir;

- avoid metadata greedy behavior when probing for external module 
metadata (mostly relevant for integration with Cray Programming 
Environment);


- also run sanity check for extensions when using --module-only (you can 
use --skip-extensions to skip sanity checking of extensions when using 
--module-only);


- fix use of --module-only on existing installations without write 
permissions;


- use unload/load in ModuleGeneratorLua.swap_module, since swap is not 
supported by Lmod;


- update HierarchicalMNS to also return ‘Toolchain//’ as 
$MODULEPATH extension for cpe* Cray toolchains;


- enhance sched_getaffinity function to avoid early crash when counting 
available cores on systems with more than 1024 cores;


- (***) enhance Python easyblock: add option to install pip with core 
Python, tweak defaults, create unversioned pip symlink;


- make custom easyblocks for GROMACS ad Tkinter work with --module-only;

- make sure that self.python_cmd is set before using it in 
PythonPackage.sanity_check_step to make PythonBundle easyblock 
comaptible with --module-only;


- (***) add patch to fix GCC 10.2.0 rejecting valid code on PPC;

- (***) update easyconfigs for binutils 2.35 to use binutils 2.35.2 
source tarball instead to pick up bug fixes;


- (***) fix test failure in TensorFlow 2.4.1 on recent CUDA drivers;

- (***) add patch to fix hardcoded num_cores in DMCfun extension 
included with R 4.0.x;


- (***) fix typo in Delly easyconfig to actually do parallel build;

- (***) fix potential memory leak in OpenBLAS 0.3.12;

- (***) add patches for PyTorch 1.7.1 avoiding failures on POWER and A100;

- fix source URLs for recent Boost and Boost.Python versions;


## Other changes

- tweak foss toolchain definition to switch from OpenBLAS to FlexiBLAS 
in foss/2021a;


- don’t skip sanity check for --module-only --rebuild (sanity check is 
still skipped with --module-only --force);


- consistently use pip to install Python packages in recent Python 
easyconfigs;


- deprecate adding a