On 5/27/21 10:46 AM, Alexander Grund wrote:
/home/modules/software/binutils/2.35-GCCcore-10.2.0/bin/ld.gold: fatal
error:
bazel-out/k8-opt/bin/tensorflow/core/common_runtime/graph_constructor_test:
No space left on device
What device might that be? As shown above, I have quite a bit of disk
space. Is /tmp being used and getting full?
> export EASYBUILD_BUILDPATH=/run/user/$UID/eb_build
> tmpfs 19G 19G 30M 100% /run/user/983
This clearly shows that your buildpath is full. So that is the issue. Try
using another buildpath, Kenneth is right, we make sure Bazel doesn't use
/tmp.
I have found out that /run/user/$UID defaults to 10% of the system RAM
memory as defined in /etc/systemd/logind.conf (see man 5 logind.conf).
This 10% value is 19 GB on my server. It seems to be prudent to use
/dev/shm in stead:
export EASYBUILD_BUILDPATH=/dev/shm
While building TensorFlow the /dev/shm grows to a gigantic size:
# df -Ph /dev/shm
Filesystem Size Used Avail Use% Mounted on
tmpfs 94G 46G 48G 50% /dev/shm
Unfortunately, the build still fails and I need to look for the source of
errors in the logfile:
== installing extension TensorFlow 2.4.1 (28/28)...
== configuring...
== building...
== testing...
== FAILED: Installation ended unsuccessfully (build directory:
/dev/shm/TensorFlow/2.4.1/fosscuda-2020b): build failed (first 300 chars):
At least 2 gpu tests failed:
//tensorflow/core/common_runtime/gpu:gpu_device_test,
//tensorflow/core/common_runtime/gpu:gpu_device_unified_memory_test_gpu
(took 55 min 27 sec)
== Results of the build can be found in the log file(s)
/scratch/modules/eb-3l5Ptk/easybuild-TensorFlow-2.4.1-20210527.114011.EmOkP.log
ERROR: Build of
/home/modules/software/EasyBuild/4.3.4/easybuild/easyconfigs/t/TensorFlow/TensorFlow-2.4.1-fosscuda-2020b.eb
failed (err: 'build failed (first 300 chars): At least 2 gpu tests
failed:\n//tensorflow/core/common_runtime/gpu:gpu_device_test,
//tensorflow/core/common_runtime/gpu:gpu_device_unified_memory_test_gpu')
/Ole