Re: [easybuild] TensorFlow build fails in //tensorflow/core/common_runtime:graph_constructor_test

2021-05-31 Thread Ole Holm Nielsen
For the record:  I managed to build TensorFlow-2.4.1-fosscuda-2020b.eb 
using this PR:

https://github.com/easybuilders/easybuild-easyconfigs/pull/12979

/Ole

On 5/27/21 1:06 PM, Alexander Grund wrote:
Yes: At the very bottom of the log there should more information about the 
failed tests. For each of those (2) tests there should be some more 
detailed output


Search for "At least 2 gpu tests failed" and look below.

FYI: Setting EASYBUILD_TMPDIR to a large directory is not required. 
Temporary files are usually small.


Am 27.05.21 um 13:02 schrieb Ole Holm Nielsen:

On 5/27/21 10:46 AM, Alexander Grund wrote:
 > Alexandre: should we look for patterns like "No space left on 
device" in the Bazel output and highlight them better, perhaps with a 
concrete suggestion to use --tmpdir to avoid the usage of /tmp?


We could in general put something into EasyBuild, yes. I started a PR 
with enhanced error parsing which could maybe be used for that.


I've configured some larger temporary file spaces:
EASYBUILD_TMPDIR=/scratch/modules  (800+ GB available)
EASYBUILD_BUILDPATH=/dev/shm   (94 GB size)

and try to build TensorFlow:

$ eb TensorFlow-2.4.1-fosscuda-2020b.eb 
--cuda-compute-capabilities=8.0,8.6 --tmpdir=/scratch/modules


== installing extension TensorFlow 2.4.1 (28/28)...
== configuring...
== building...
== testing...
== FAILED: Installation ended unsuccessfully (build directory: 
/dev/shm/TensorFlow/2.4.1/fosscuda-2020b): build failed (first 300 
chars): At least 2 gpu tests failed:
//tensorflow/core/common_runtime/gpu:gpu_device_test, 
//tensorflow/core/common_runtime/gpu:gpu_device_unified_memory_test_gpu 
(took 55 min 27 sec)
== Results of the build can be found in the log file(s) 
/scratch/modules/eb-3l5Ptk/easybuild-TensorFlow-2.4.1-20210527.114011.EmOkP.log 

ERROR: Build of 
/home/modules/software/EasyBuild/4.3.4/easybuild/easyconfigs/t/TensorFlow/TensorFlow-2.4.1-fosscuda-2020b.eb 
failed (err: 'build failed (first 300 chars): At least 2 gpu tests 
failed:\n//tensorflow/core/common_runtime/gpu:gpu_device_test, 
//tensorflow/core/common_runtime/gpu:gpu_device_unified_memory_test_gpu')


...

Is there anything else I should look for in the logfile (size: 234 MB)?


Re: [easybuild] TensorFlow build fails in //tensorflow/core/common_runtime:graph_constructor_test

2021-05-27 Thread Alexander Grund


If you would help by analyzing the logfile, I can gzip it and send you 
an URL?



Might have been above, but please do so. Might be easiest



smime.p7s
Description: S/MIME Cryptographic Signature


Re: [easybuild] TensorFlow build fails in //tensorflow/core/common_runtime:graph_constructor_test

2021-05-27 Thread Alexander Grund
Yes: At the very bottom of the log there should more information about 
the failed tests. For each of those (2) tests there should be some more 
detailed output


Search for "At least 2 gpu tests failed" and look below.

FYI: Setting EASYBUILD_TMPDIR to a large directory is not required. 
Temporary files are usually small.


Am 27.05.21 um 13:02 schrieb Ole Holm Nielsen:

On 5/27/21 10:46 AM, Alexander Grund wrote:
 > Alexandre: should we look for patterns like "No space left on 
device" in the Bazel output and highlight them better, perhaps with a 
concrete suggestion to use --tmpdir to avoid the usage of /tmp?


We could in general put something into EasyBuild, yes. I started a PR 
with enhanced error parsing which could maybe be used for that.


I've configured some larger temporary file spaces:
EASYBUILD_TMPDIR=/scratch/modules  (800+ GB available)
EASYBUILD_BUILDPATH=/dev/shm   (94 GB size)

and try to build TensorFlow:

$ eb TensorFlow-2.4.1-fosscuda-2020b.eb 
--cuda-compute-capabilities=8.0,8.6 --tmpdir=/scratch/modules


== installing extension TensorFlow 2.4.1 (28/28)...
== configuring...
== building...
== testing...
== FAILED: Installation ended unsuccessfully (build directory: 
/dev/shm/TensorFlow/2.4.1/fosscuda-2020b): build failed (first 300 
chars): At least 2 gpu tests failed:
//tensorflow/core/common_runtime/gpu:gpu_device_test, 
//tensorflow/core/common_runtime/gpu:gpu_device_unified_memory_test_gpu 
(took 55 min 27 sec)
== Results of the build can be found in the log file(s) 
/scratch/modules/eb-3l5Ptk/easybuild-TensorFlow-2.4.1-20210527.114011.EmOkP.log
ERROR: Build of 
/home/modules/software/EasyBuild/4.3.4/easybuild/easyconfigs/t/TensorFlow/TensorFlow-2.4.1-fosscuda-2020b.eb 
failed (err: 'build failed (first 300 chars): At least 2 gpu tests 
failed:\n//tensorflow/core/common_runtime/gpu:gpu_device_test, 
//tensorflow/core/common_runtime/gpu:gpu_device_unified_memory_test_gpu')


...

Is there anything else I should look for in the logfile (size: 234 MB)?




smime.p7s
Description: S/MIME Cryptographic Signature


Re: [easybuild] TensorFlow build fails in //tensorflow/core/common_runtime:graph_constructor_test

2021-05-27 Thread Ole Holm Nielsen

On 5/27/21 1:06 PM, Alexander Grund wrote:
Yes: At the very bottom of the log there should more information about the 
failed tests. For each of those (2) tests there should be some more 
detailed output


Search for "At least 2 gpu tests failed" and look below.


This is at the very end of the logfile:

[--] Global test environment tear-down
[==] 19 tests from 2 test suites ran. (2972 ms total)
[  PASSED  ] 18 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] GPUDeviceTest.SingleVirtualDeviceWithInvalidPriority

 1 FAILED TEST

== 2021-05-27 12:35:39,386 build_log.py:169 ERROR EasyBuild crashed with 
an error (at easybuild/base/exceptions.py:124 in __init__): At least 2 gpu 
tests failed:
//tensorflow/core/common_runtime/gpu:gpu_device_test, 
//tensorflow/core/common_runtime/gpu:gpu_device_unified_memory_test_gpu 
(at easybuild/easyblocks/t/tensorflow.py:973 in test_step)
== 2021-05-27 12:35:39,386 filetools.py:1810 INFO Removing lock 
/home/modules/software/.locks/_home_modules_software_TensorFlow_2.4.1-fosscuda-2020b.lock...
== 2021-05-27 12:35:39,387 filetools.py:347 INFO Path 
/home/modules/software/.locks/_home_modules_software_TensorFlow_2.4.1-fosscuda-2020b.lock 
successfully removed.
== 2021-05-27 12:35:39,388 filetools.py:1814 INFO Lock removed: 
/home/modules/software/.locks/_home_modules_software_TensorFlow_2.4.1-fosscuda-2020b.lock
== 2021-05-27 12:35:39,388 easyblock.py:3414 WARNING build failed (first 
300 chars): At least 2 gpu tests failed:
//tensorflow/core/common_runtime/gpu:gpu_device_test, 
//tensorflow/core/common_runtime/gpu:gpu_device_unified_memory_test_gpu
== 2021-05-27 12:35:39,388 easyblock.py:298 INFO Closing log for 
application name TensorFlow version 2.4.1



If you would help by analyzing the logfile, I can gzip it and send you an URL?

Thanks,
Ole


FYI: Setting EASYBUILD_TMPDIR to a large directory is not required. 
Temporary files are usually small.


Am 27.05.21 um 13:02 schrieb Ole Holm Nielsen:

On 5/27/21 10:46 AM, Alexander Grund wrote:
 > Alexandre: should we look for patterns like "No space left on 
device" in the Bazel output and highlight them better, perhaps with a 
concrete suggestion to use --tmpdir to avoid the usage of /tmp?


We could in general put something into EasyBuild, yes. I started a PR 
with enhanced error parsing which could maybe be used for that.


I've configured some larger temporary file spaces:
EASYBUILD_TMPDIR=/scratch/modules  (800+ GB available)
EASYBUILD_BUILDPATH=/dev/shm   (94 GB size)

and try to build TensorFlow:

$ eb TensorFlow-2.4.1-fosscuda-2020b.eb 
--cuda-compute-capabilities=8.0,8.6 --tmpdir=/scratch/modules


== installing extension TensorFlow 2.4.1 (28/28)...
== configuring...
== building...
== testing...
== FAILED: Installation ended unsuccessfully (build directory: 
/dev/shm/TensorFlow/2.4.1/fosscuda-2020b): build failed (first 300 
chars): At least 2 gpu tests failed:
//tensorflow/core/common_runtime/gpu:gpu_device_test, 
//tensorflow/core/common_runtime/gpu:gpu_device_unified_memory_test_gpu 
(took 55 min 27 sec)
== Results of the build can be found in the log file(s) 
/scratch/modules/eb-3l5Ptk/easybuild-TensorFlow-2.4.1-20210527.114011.EmOkP.log 

ERROR: Build of 
/home/modules/software/EasyBuild/4.3.4/easybuild/easyconfigs/t/TensorFlow/TensorFlow-2.4.1-fosscuda-2020b.eb 
failed (err: 'build failed (first 300 chars): At least 2 gpu tests 
failed:\n//tensorflow/core/common_runtime/gpu:gpu_device_test, 
//tensorflow/core/common_runtime/gpu:gpu_device_unified_memory_test_gpu')


...

Is there anything else I should look for in the logfile (size: 234 MB)?




Re: [easybuild] TensorFlow build fails in //tensorflow/core/common_runtime:graph_constructor_test

2021-05-27 Thread Ole Holm Nielsen

On 5/27/21 10:46 AM, Alexander Grund wrote:
 > Alexandre: should we look for patterns like "No space left on device" 
in the Bazel output and highlight them better, perhaps with a concrete 
suggestion to use --tmpdir to avoid the usage of /tmp?


We could in general put something into EasyBuild, yes. I started a PR with 
enhanced error parsing which could maybe be used for that.


I've configured some larger temporary file spaces:
EASYBUILD_TMPDIR=/scratch/modules  (800+ GB available)
EASYBUILD_BUILDPATH=/dev/shm   (94 GB size)

and try to build TensorFlow:

$ eb TensorFlow-2.4.1-fosscuda-2020b.eb 
--cuda-compute-capabilities=8.0,8.6 --tmpdir=/scratch/modules


== installing extension TensorFlow 2.4.1 (28/28)...
==  configuring...
==  building...
==  testing...
== FAILED: Installation ended unsuccessfully (build directory: 
/dev/shm/TensorFlow/2.4.1/fosscuda-2020b): build failed (first 300 chars): 
At least 2 gpu tests failed:
//tensorflow/core/common_runtime/gpu:gpu_device_test, 
//tensorflow/core/common_runtime/gpu:gpu_device_unified_memory_test_gpu 
(took 55 min 27 sec)
== Results of the build can be found in the log file(s) 
/scratch/modules/eb-3l5Ptk/easybuild-TensorFlow-2.4.1-20210527.114011.EmOkP.log
ERROR: Build of 
/home/modules/software/EasyBuild/4.3.4/easybuild/easyconfigs/t/TensorFlow/TensorFlow-2.4.1-fosscuda-2020b.eb 
failed (err: 'build failed (first 300 chars): At least 2 gpu tests 
failed:\n//tensorflow/core/common_runtime/gpu:gpu_device_test, 
//tensorflow/core/common_runtime/gpu:gpu_device_unified_memory_test_gpu')


In the logfile I see multiple FAILED tests:

$ grep FAILED 
/scratch/modules/eb-3l5Ptk/easybuild-TensorFlow-2.4.1-20210527.114011.EmOkP.log

FAILED: //tensorflow/core/common_runtime/gpu:gpu_device_test (Summary)
[  FAILED  ] GPUDeviceTest.SingleVirtualDeviceWithInvalidPriority (79 ms)
[  FAILED  ] 1 test, listed below:
[  FAILED  ] GPUDeviceTest.SingleVirtualDeviceWithInvalidPriority
 1 FAILED TEST
[  FAILED  ] GPUDeviceTest.SingleVirtualDeviceWithInvalidPriority (323 ms)
[  FAILED  ] 1 test, listed below:
[  FAILED  ] GPUDeviceTest.SingleVirtualDeviceWithInvalidPriority
 1 FAILED TEST
[  FAILED  ] GPUDeviceTest.SingleVirtualDeviceWithInvalidPriority (128 ms)
[  FAILED  ] 1 test, listed below:
[  FAILED  ] GPUDeviceTest.SingleVirtualDeviceWithInvalidPriority
 1 FAILED TEST
FAILED: 
//tensorflow/core/common_runtime/gpu:gpu_device_unified_memory_test_gpu 
(Summary)

[  FAILED  ] GPUDeviceTest.SingleVirtualDeviceWithInvalidPriority (40 ms)
[  FAILED  ] 1 test, listed below:
[  FAILED  ] GPUDeviceTest.SingleVirtualDeviceWithInvalidPriority
 1 FAILED TEST
[  FAILED  ] GPUDeviceTest.SingleVirtualDeviceWithInvalidPriority (158 ms)
[  FAILED  ] 1 test, listed below:
[  FAILED  ] GPUDeviceTest.SingleVirtualDeviceWithInvalidPriority
 1 FAILED TEST
[  FAILED  ] GPUDeviceTest.SingleVirtualDeviceWithInvalidPriority (77 ms)
[  FAILED  ] 1 test, listed below:
[  FAILED  ] GPUDeviceTest.SingleVirtualDeviceWithInvalidPriority
 1 FAILED TEST
//tensorflow/core/common_runtime/gpu:gpu_device_test 
FAILED in 3 out of 3 in 4.8s
//tensorflow/core/common_runtime/gpu:gpu_device_unified_memory_test_gpu 
FAILED in 3 out of 3 in 3.5s

FAILED: //tensorflow/core/common_runtime/gpu:gpu_device_test (Summary)
[  FAILED  ] GPUDeviceTest.SingleVirtualDeviceWithInvalidPriority (79 
ms)
[  FAILED  ] 1 test, listed below:
[  FAILED  ] GPUDeviceTest.SingleVirtualDeviceWithInvalidPriority
 1 FAILED TEST
[  FAILED  ] GPUDeviceTest.SingleVirtualDeviceWithInvalidPriority (323 
ms)
[  FAILED  ] 1 test, listed below:
[  FAILED  ] GPUDeviceTest.SingleVirtualDeviceWithInvalidPriority
 1 FAILED TEST
[  FAILED  ] GPUDeviceTest.SingleVirtualDeviceWithInvalidPriority (128 
ms)
[  FAILED  ] 1 test, listed below:
[  FAILED  ] GPUDeviceTest.SingleVirtualDeviceWithInvalidPriority
 1 FAILED TEST
	FAILED: 
//tensorflow/core/common_runtime/gpu:gpu_device_unified_memory_test_gpu 
(Summary)

[  FAILED  ] GPUDeviceTest.SingleVirtualDeviceWithInvalidPriority (40 
ms)
[  FAILED  ] 1 test, listed below:
[  FAILED  ] GPUDeviceTest.SingleVirtualDeviceWithInvalidPriority
 1 FAILED TEST
[  FAILED  ] GPUDeviceTest.SingleVirtualDeviceWithInvalidPriority (158 
ms)
[  FAILED  ] 1 test, listed below:
[  FAILED  ] GPUDeviceTest.SingleVirtualDeviceWithInvalidPriority
 1 FAILED TEST
[  FAILED  ] GPUDeviceTest.SingleVirtualDeviceWithInvalidPriority (77 
ms)
[  FAILED  ] 1 test, listed below:
[  FAILED  ] GPUDeviceTest.SingleVirtualDeviceWithInvalidPriority
 1 FAILED TEST
	//tensorflow/core/common_runtime/gpu:gpu_device_test 
FAILED in 3 out of 3 in 4.8s
	//tensorflow/core/common_runtime/gpu:gpu_device_unified_memory_test_gpu 
FAILED in 3 out of 3 in 3.5s)

FAILED: //tensorflow/core/common_runtime/gpu:gpu_device_test (Summary)
[  FAILED  ] 

Re: [easybuild] TensorFlow build fails in //tensorflow/core/common_runtime:graph_constructor_test

2021-05-27 Thread Ole Holm Nielsen

On 5/27/21 10:46 AM, Alexander Grund wrote:
/home/modules/software/binutils/2.35-GCCcore-10.2.0/bin/ld.gold: fatal 
error: 
bazel-out/k8-opt/bin/tensorflow/core/common_runtime/graph_constructor_test: 
No space left on device


What device might that be?  As shown above, I have quite a bit of disk 
space.  Is /tmp being used and getting full?


 > export EASYBUILD_BUILDPATH=/run/user/$UID/eb_build

 > tmpfs   19G   19G   30M 100% /run/user/983

This clearly shows that your buildpath is full. So that is the issue. Try 
using another buildpath, Kenneth is right, we make sure Bazel doesn't use 
/tmp.


I have found out that /run/user/$UID defaults to 10% of the system RAM 
memory as defined in /etc/systemd/logind.conf (see man 5 logind.conf). 
This 10% value is 19 GB on my server. It seems to be prudent to use 
/dev/shm in stead:


export EASYBUILD_BUILDPATH=/dev/shm

While building TensorFlow the /dev/shm grows to a gigantic size:

# df -Ph /dev/shm
Filesystem  Size  Used Avail Use% Mounted on
tmpfs94G   46G   48G  50% /dev/shm

Unfortunately, the build still fails and I need to look for the source of 
errors in the logfile:


== installing extension TensorFlow 2.4.1 (28/28)...
==  configuring...
==  building...
==  testing...
== FAILED: Installation ended unsuccessfully (build directory: 
/dev/shm/TensorFlow/2.4.1/fosscuda-2020b): build failed (first 300 chars): 
At least 2 gpu tests failed:
//tensorflow/core/common_runtime/gpu:gpu_device_test, 
//tensorflow/core/common_runtime/gpu:gpu_device_unified_memory_test_gpu 
(took 55 min 27 sec)
== Results of the build can be found in the log file(s) 
/scratch/modules/eb-3l5Ptk/easybuild-TensorFlow-2.4.1-20210527.114011.EmOkP.log
ERROR: Build of 
/home/modules/software/EasyBuild/4.3.4/easybuild/easyconfigs/t/TensorFlow/TensorFlow-2.4.1-fosscuda-2020b.eb 
failed (err: 'build failed (first 300 chars): At least 2 gpu tests 
failed:\n//tensorflow/core/common_runtime/gpu:gpu_device_test, 
//tensorflow/core/common_runtime/gpu:gpu_device_unified_memory_test_gpu')



/Ole


Re: [easybuild] TensorFlow build fails in //tensorflow/core/common_runtime:graph_constructor_test

2021-05-27 Thread Strube, Alexandre
> 
> > export EASYBUILD_BUILDPATH=/run/user/$UID/eb_build
> 
> > tmpfs   19G   19G   30M 100% /run/user/983
> 
> This clearly shows that your buildpath is full. So that is the issue. Try 
> using another buildpath, Kenneth is right, we make sure Bazel doesn't use 
> /tmp.
> 
>> 
>>> I'd also suggest to join Slack as discussions there are potentially faster.
>> 
>> I'll take a look - are there instructions for Slack?
> 
> https://easybuild-slack.herokuapp.com/
> 
> > Alexandre: should we look for patterns like "No space left on device" in 
> > the Bazel output and highlight them better, perhaps with a concrete 
> > suggestion to use --tmpdir to avoid the usage of /tmp?
> 
> We could in general put something into EasyBuild, yes. I started a PR with 
> enhanced error parsing which could maybe be used for that.

Since my name was mentioned (by mistake, but I’m happy), and I also spent an 
inordinate amount of time on tensorflow, 

Its build ALWAYS fills up our /tmp. So, yea, use some separate place.

At JSC, we have some patches for tensorflow 2.3.1 which might be slightly 
different from Alexander’s (which I would call official): 
https://github.com/easybuilders/JSC/tree/2020/Golden_Repo/t/TensorFlow 


And we set the compute_capabilites right at the easyconfig

smime.p7s
Description: S/MIME cryptographic signature


Re: [easybuild] TensorFlow build fails in //tensorflow/core/common_runtime:graph_constructor_test

2021-05-27 Thread Ole Holm Nielsen

Hi Loris,

On 5/27/21 10:34 AM, Loris Bennett wrote:

What device might that be?  As shown above, I have quite a bit of disk space.
Is /tmp being used and getting full?


This might be the case.  In the past I ran into this problem and solved
it with the following:

   eb TensorFlow-1.15.0-fosscuda-2019b-Python-3.7.4.eb --robot 
--cuda-compute-capabilities=6.1,7.5 --buildpath=/dev/shm 
--tmpdir=/scratch/eb-build


Yes, I configured that with:

export EASYBUILD_BUILDPATH=/run/user/$UID/eb_build
ulimit -s 2000240
export EASYBUILD_TMPDIR=/scratch/$USER

Thanks,
Ole

--
Ole Holm Nielsen
PhD, Senior HPC Officer
Department of Physics, Technical University of Denmark,
Fysikvej Building 309, DK-2800 Kongens Lyngby, Denmark
E-mail: ole.h.niel...@fysik.dtu.dk
Homepage: http://dcwww.fysik.dtu.dk/~ohnielse/
Mobile: (+45) 5180 1620


Re: [easybuild] TensorFlow build fails in //tensorflow/core/common_runtime:graph_constructor_test

2021-05-27 Thread Alexander Grund



Please note these two errors:

WARNING: Download from 
https://storage.googleapis.com/mirror.tensorflow.org/github.com/llvm/llvm-project/archive/f402e682d0ef5598eeffc9a21a691b03e602ff58.tar.gz 
failed: class 
com.google.devtools.build.lib.bazel.repository.downloader.UnrecoverableHttpExcep

tion GET returned 404 Not Found


Is the URL outdated?

That's ok. TF has a fallback


/home/modules/software/binutils/2.35-GCCcore-10.2.0/bin/ld.gold: 
fatal error: 
bazel-out/k8-opt/bin/tensorflow/core/common_runtime/graph_constructor_test: 
No space left on device


What device might that be?  As shown above, I have quite a bit of disk 
space.  Is /tmp being used and getting full?


> export EASYBUILD_BUILDPATH=/run/user/$UID/eb_build

> tmpfs   19G   19G   30M 100% /run/user/983

This clearly shows that your buildpath is full. So that is the issue. 
Try using another buildpath, Kenneth is right, we make sure Bazel 
doesn't use /tmp.




I'd also suggest to join Slack as discussions there are potentially 
faster.


I'll take a look - are there instructions for Slack?


https://easybuild-slack.herokuapp.com/

> Alexandre: should we look for patterns like "No space left on device" 
in the Bazel output and highlight them better, perhaps with a concrete 
suggestion to use --tmpdir to avoid the usage of /tmp?


We could in general put something into EasyBuild, yes. I started a PR 
with enhanced error parsing which could maybe be used for that.






smime.p7s
Description: S/MIME Cryptographic Signature


Re: [easybuild] TensorFlow build fails in //tensorflow/core/common_runtime:graph_constructor_test

2021-05-27 Thread Kenneth Hoste

(please keep Alexander in the loop)

On 27/05/2021 10:34, Loris Bennett wrote:

Ole Holm Nielsen  writes:


On 5/27/21 9:48 AM, Alexander Grund wrote:



The EB log file reports an error:

//tensorflow/core/common_runtime:graph_constructor_test FAILED TO BUILD

and the log file ends with:

Executed 137 out of 814 tests: 137 tests pass, 1 fails to build and 676 were
skipped.
FAILED: Build did NOT complete successfully

This is a build failure, so something we should fix or at least find the
cause.
Please check the log, there should be something about why/how it failed to
compile. Just search for the name and scroll a bit around. If you attach it, I
can also take a look.


The EB log file is 205 MB, so it's hard to share :-(

I have this environment:

export EASYBUILD_BUILDPATH=/run/user/$UID/eb_build
ulimit -s 2000240
export EASYBUILD_TMPDIR=/scratch/$USER

and there is quite a bit of space available:

$ df -h /run/user/$UID/eb_build /scratch
Filesystem Size  Used Avail Use% Mounted on
tmpfs   19G   19G   30M 100% /run/user/983
/dev/mapper/VolGroup00-lv_scratch  850G  675M  849G   1% /scratch

...


/home/modules/software/binutils/2.35-GCCcore-10.2.0/bin/ld.gold: fatal error: 
bazel-out/k8-opt/bin/tensorflow/core/common_runtime/graph_constructor_test: No 
space left on device


What device might that be?  As shown above, I have quite a bit of disk space.
Is /tmp being used and getting full?


This might be the case.  In the past I ran into this problem and solved
it with the following:

   eb TensorFlow-1.15.0-fosscuda-2019b-Python-3.7.4.eb --robot 
--cuda-compute-capabilities=6.1,7.5 --buildpath=/dev/shm 
--tmpdir=/scratch/eb-build


Hmm, this surprises me a bit, because I think we make an effort to avoid 
that Bazel is using /tmp for too many things, and we tell it to use the 
build directory instead...


Please try using --tmpdir to specify an alternate directory than /tmp, 
and see if that helps at all.


Alexandre: should we look for patterns like "No space left on device" in 
the Bazel output and highlight them better, perhaps with a concrete 
suggestion to use --tmpdir to avoid the usage of /tmp?



regards,

Kenneth




YMMV

Cheers,

Loris


I'd also suggest to join Slack as discussions there are potentially faster.


I'll take a look - are there instructions for Slack?

Thanks,
Ole


Re: [easybuild] TensorFlow build fails in //tensorflow/core/common_runtime:graph_constructor_test

2021-05-27 Thread Loris Bennett
Ole Holm Nielsen  writes:

> On 5/27/21 9:48 AM, Alexander Grund wrote:
>>
 The EB log file reports an error:

 //tensorflow/core/common_runtime:graph_constructor_test FAILED TO BUILD

 and the log file ends with:

 Executed 137 out of 814 tests: 137 tests pass, 1 fails to build and 676 
 were
 skipped.
 FAILED: Build did NOT complete successfully
>> This is a build failure, so something we should fix or at least find the
>> cause.
>> Please check the log, there should be something about why/how it failed to
>> compile. Just search for the name and scroll a bit around. If you attach it, 
>> I
>> can also take a look.
>
> The EB log file is 205 MB, so it's hard to share :-(
>
> I have this environment:
>
> export EASYBUILD_BUILDPATH=/run/user/$UID/eb_build
> ulimit -s 2000240
> export EASYBUILD_TMPDIR=/scratch/$USER
>
> and there is quite a bit of space available:
>
> $ df -h /run/user/$UID/eb_build /scratch
> Filesystem Size  Used Avail Use% Mounted on
> tmpfs   19G   19G   30M 100% /run/user/983
> /dev/mapper/VolGroup00-lv_scratch  850G  675M  849G   1% /scratch
>
> Searching for FAIL in the log file, I noticed this section:
>
> == 2021-05-26 15:20:28,456 tensorflow.py:899 INFO Starting cpu test
> == 2021-05-26 15:20:28,457 run.py:225 INFO running cmd:  bazel
> --output_user_root=/run/user/983/eb_build/TensorFlow/2.4.1/fosscuda-2020b/tmpkYJDaH-bazel-tf
> --host_jvm_args=-Xms512m --host_jvm_args=-Xmx4096m test --config=noaws
> --config=nogcp --config=nohd
> fs --compilation_mode=opt --config=opt --subcommands --verbose_failures
> --jobs=64 --copt="-fPIC"
> --action_env=CPATH='/home/modules/software/cURL/7.72.0-GCCcore-10.2.0/include:/home/modules/software/double-conversion/3.1.5-GCCcore-10.2.0/include:/home/modu
> les/software/flatbuffers/1.12.0-GCCcore-10.2.0/include:/home/modules/software/giflib/5.2.1-GCCcore-10.2.0/include:/home/modules/software/hwloc/2.2.0-GCCcore-10.2.0/include:/home/modules/software/ICU/67.1-GCCcore-10.2.0/include:/home/modules/software/JsonC
> pp/1.9.4-GCCcore-10.2.0/include:/home/modules/software/libjpeg-turbo/2.0.5-GCCcore-10.2.0/include:/home/modules/software/libpng/1.6.37-GCCcore-10.2.0/include:/home/modules/software/LMDB/0.9.24-GCCcore-10.2.0/include:/home/modules/software/nsync/1.24.0-GCC
> core-10.2.0/include:/home/modules/software/PCRE/8.44-GCCcore-10.2.0/include:/home/modules/software/protobuf/3.14.0-GCCcore-10.2.0/include:/home/modules/software/pybind11/2.6.0-GCCcore-10.2.0/include:/home/modules/software/snappy/1.1.8-GCCcore-10.2.0/inclu
> de:/home/modules/software/SQLite/3.33.0-GCCcore-10.2.0/include:/home/modules/software/zlib/1.2.11-GCCcore-10.2.0/include'
> --action_env=LIBRARY_PATH='/home/modules/software/cURL/7.72.0-GCCcore-10.2.0/lib:/home/modules/software/double-conversion/3.1.5-GCCco
> re-10.2.0/lib:/home/modules/software/flatbuffers/1.12.0-GCCcore-10.2.0/lib:/home/modules/software/giflib/5.2.1-GCCcore-10.2.0/lib:/home/modules/software/hwloc/2.2.0-GCCcore-10.2.0/lib:/home/modules/software/ICU/67.1-GCCcore-10.2.0/lib:/home/modules/softwa
> re/JsonCpp/1.9.4-GCCcore-10.2.0/lib:/home/modules/software/libjpeg-turbo/2.0.5-GCCcore-10.2.0/lib64:/home/modules/software/libpng/1.6.37-GCCcore-10.2.0/lib:/home/modules/software/LMDB/0.9.24-GCCcore-10.2.0/lib:/home/modules/software/nsync/1.24.0-GCCcore-1
> 0.2.0/lib:/home/modules/software/PCRE/8.44-GCCcore-10.2.0/lib:/home/modules/software/protobuf/3.14.0-GCCcore-10.2.0/lib:/home/modules/software/pybind11/2.6.0-GCCcore-10.2.0/lib:/home/modules/software/snappy/1.1.8-GCCcore-10.2.0/lib:/home/modules/software/
> SQLite/3.33.0-GCCcore-10.2.0/lib:/home/modules/software/zlib/1.2.11-GCCcore-10.2.0/lib'
> --action_env=PYTHONPATH --action_env=PYTHONNOUSERSITE=1
> --distinct_host_configuration=false --config=mkl --test_output=errors
> --build_tests_only --local_test_jobs=64 -
> -test_tag_filters='-gpu,-tpu,-no_cuda_on_cpu_tap,-no_pip,-no_oss,-oss_serial,-benchmark-test,-v1only'
> --build_tag_filters='-gpu,-tpu,-no_cuda_on_cpu_tap,-no_pip,-no_oss,-oss_serial,-benchmark-test,-v1only'
> --test_env=CUDA_VISIBLE_DEVICES='-1' --test_timeo
> ut=3600 --test_size_filters=small --
> //tensorflow/core/... -//tensorflow/core:example_java_proto
> -//tensorflow/core/example:example_protos_closure
> //tensorflow/cc/... //tensorflow/c/... //tensorflow/python/... 
> -//tensorflow/core/profiler/internal/gpu:devi
> ce_tracer_test -//tensorflow/c/eager:c_api_test_gpu
> -//tensorflow/c/eager:c_api_distributed_test
> -//tensorflow/c/eager:c_api_distributed_test_gpu
> -//tensorflow/c/eager:c_api_cluster_test_gpu
> -//tensorflow/c/eager:c_api_remote_function_test_gpu -//tensorfl
> ow/c/eager:c_api_remote_test_gpu
> -//tensorflow/core/kernels:sparse_matmul_op_test
> -//tensorflow/core/kernels:sparse_matmul_op_test_gpu
> -//tensorflow/core/common_runtime:collective_param_resolver_local_test
> 

Re: [easybuild] TensorFlow build fails in //tensorflow/core/common_runtime:graph_constructor_test

2021-05-27 Thread Jakob Schiøtz
> 
>> I'd also suggest to join Slack as discussions there are potentially faster.
> 
> I'll take a look - are there instructions for Slack?

Once can join here

https://easybuild-slack.herokuapp.com/

Jakob



Re: [easybuild] TensorFlow build fails in //tensorflow/core/common_runtime:graph_constructor_test

2021-05-27 Thread Ole Holm Nielsen
On 5/27/21 9:48 AM, Alexander Grund wrote:

The EB log file reports an error:

//tensorflow/core/common_runtime:graph_constructor_test FAILED TO BUILD

and the log file ends with:

Executed 137 out of 814 tests: 137 tests pass, 1 fails to build and 676 were skipped.
FAILED: Build did NOT complete successfully
This is a build failure, so something we should fix or at least find the cause. Please check the log, there should be something about why/how it failed to compile. Just search for the name and scroll a bit around. If you attach it, I can also take a look.

The EB log file is 205 MB, so it's hard to share :-(

I have this environment:

export EASYBUILD_BUILDPATH=/run/user/$UID/eb_build
ulimit -s 2000240
export EASYBUILD_TMPDIR=/scratch/$USER

and there is quite a bit of space available:

$ df -h /run/user/$UID/eb_build /scratch
Filesystem Size  Used Avail Use% Mounted on
tmpfs   19G   19G   30M 100% /run/user/983
/dev/mapper/VolGroup00-lv_scratch  850G  675M  849G   1% /scratch

Searching for FAIL in the log file, I noticed this section:

== 2021-05-26 15:20:28,456 tensorflow.py:899 INFO Starting cpu test
== 2021-05-26 15:20:28,457 run.py:225 INFO running cmd: bazel --output_user_root=/run/user/983/eb_build/TensorFlow/2.4.1/fosscuda-2020b/tmpkYJDaH-bazel-tf --host_jvm_args=-Xms512m --host_jvm_args=-Xmx4096m test --config=noaws --config=nogcp --config=nohd fs --compilation_mode=opt --config=opt --subcommands --verbose_failures --jobs=64 --copt="-fPIC" --action_env=CPATH='/home/modules/software/cURL/7.72.0-GCCcore-10.2.0/include:/home/modules/software/double-conversion/3.1.5-GCCcore-10.2.0/include:/home/modu
les/software/flatbuffers/1.12.0-GCCcore-10.2.0/include:/home/modules/software/giflib/5.2.1-GCCcore-10.2.0/include:/home/modules/software/hwloc/2.2.0-GCCcore-10.2.0/include:/home/modules/software/ICU/67.1-GCCcore-10.2.0/include:/home/modules/software/JsonC
pp/1.9.4-GCCcore-10.2.0/include:/home/modules/software/libjpeg-turbo/2.0.5-GCCcore-10.2.0/include:/home/modules/software/libpng/1.6.37-GCCcore-10.2.0/include:/home/modules/software/LMDB/0.9.24-GCCcore-10.2.0/include:/home/modules/software/nsync/1.24.0-GCC
core-10.2.0/include:/home/modules/software/PCRE/8.44-GCCcore-10.2.0/include:/home/modules/software/protobuf/3.14.0-GCCcore-10.2.0/include:/home/modules/software/pybind11/2.6.0-GCCcore-10.2.0/include:/home/modules/software/snappy/1.1.8-GCCcore-10.2.0/inclu
de:/home/modules/software/SQLite/3.33.0-GCCcore-10.2.0/include:/home/modules/software/zlib/1.2.11-GCCcore-10.2.0/include' --action_env=LIBRARY_PATH='/home/modules/software/cURL/7.72.0-GCCcore-10.2.0/lib:/home/modules/software/double-conversion/3.1.5-GCCco
re-10.2.0/lib:/home/modules/software/flatbuffers/1.12.0-GCCcore-10.2.0/lib:/home/modules/software/giflib/5.2.1-GCCcore-10.2.0/lib:/home/modules/software/hwloc/2.2.0-GCCcore-10.2.0/lib:/home/modules/software/ICU/67.1-GCCcore-10.2.0/lib:/home/modules/softwa
re/JsonCpp/1.9.4-GCCcore-10.2.0/lib:/home/modules/software/libjpeg-turbo/2.0.5-GCCcore-10.2.0/lib64:/home/modules/software/libpng/1.6.37-GCCcore-10.2.0/lib:/home/modules/software/LMDB/0.9.24-GCCcore-10.2.0/lib:/home/modules/software/nsync/1.24.0-GCCcore-1
0.2.0/lib:/home/modules/software/PCRE/8.44-GCCcore-10.2.0/lib:/home/modules/software/protobuf/3.14.0-GCCcore-10.2.0/lib:/home/modules/software/pybind11/2.6.0-GCCcore-10.2.0/lib:/home/modules/software/snappy/1.1.8-GCCcore-10.2.0/lib:/home/modules/software/
SQLite/3.33.0-GCCcore-10.2.0/lib:/home/modules/software/zlib/1.2.11-GCCcore-10.2.0/lib' --action_env=PYTHONPATH --action_env=PYTHONNOUSERSITE=1 --distinct_host_configuration=false --config=mkl --test_output=errors --build_tests_only --local_test_jobs=64 - -test_tag_filters='-gpu,-tpu,-no_cuda_on_cpu_tap,-no_pip,-no_oss,-oss_serial,-benchmark-test,-v1only' --build_tag_filters='-gpu,-tpu,-no_cuda_on_cpu_tap,-no_pip,-no_oss,-oss_serial,-benchmark-test,-v1only' --test_env=CUDA_VISIBLE_DEVICES='-1' --test_timeo ut=3600 --test_size_filters=small -- //tensorflow/core/... -//tensorflow/core:example_java_proto -//tensorflow/core/example:example_protos_closure //tensorflow/cc/... //tensorflow/c/... //tensorflow/python/...

Re: [easybuild] TensorFlow build fails in //tensorflow/core/common_runtime:graph_constructor_test




The EB log file reports an error:

//tensorflow/core/common_runtime:graph_constructor_test FAILED TO BUILD

and the log file ends with:

Executed 137 out of 814 tests: 137 tests pass, 1 fails to build and 
676 were skipped.

FAILED: Build did NOT complete successfully
This is a build failure, so something we should fix or at least find the 
cause.
Please check the log, there should be something about why/how it failed 
to compile. Just search for the name and scroll a bit around. If you 
attach it, I can also take a look.


I'd also suggest to join Slack as discussions there are potentially faster.

Best,
Alex




smime.p7s
Description: S/MIME Cryptographic Signature


Re: [easybuild] TensorFlow build fails in //tensorflow/core/common_runtime:graph_constructor_test


Hi Ole,

The TensorFlow tests are known to be a PITA (to put it mildly).

Alexander (who I included in CC since he's not on the mailing list) has 
spent quite a bit of time weeding things out there, so maybe he has 
specific suggestions.


An easy way out if you're willing to ignore failing tests is to use "eb 
--skip-test-step" to do the installation.


You can also filter out specific tests via the test_targets custom 
easyconfig parameter for TensorFlow, which is already used in 
TensorFlow-2.4.1-foss-2020b.eb .


From the output you shared it's not clear how badly that specific test 
is failing though, you should be able to dig up more information from 
the EasyBuild log file on that...



regards,

Kenneth

On 26/05/2021 16:07, Ole Holm Nielsen wrote:

I'm trying to build TensorFlow with EB 4.3.4 but get an error:

$ eb TensorFlow-2.4.1-fosscuda-2020b.eb 
--cuda-compute-capabilities=8.0,8.6 --tmpdir=/scratch/modules


(lines deleted)
== installing extension TensorFlow 2.4.1 (28/28)...
== configuring...
== building...
== testing...
== FAILED: Installation ended unsuccessfully (build directory: 
/run/user/983/eb_build/TensorFlow/2.4.1/fosscuda-2020b): build failed 
(first 300 chars): At least 1 cpu tests failed:
//tensorflow/core/common_runtime:graph_constructor_test (took 43 min 58 
sec)
== Results of the build can be found in the log file(s) 
/scratch/modules/eb-KPZu0P/easybuild-TensorFlow-2.4.1-20210526.144651.PuIWy.log 

ERROR: Build of 
/home/modules/software/EasyBuild/4.3.4/easybuild/easyconfigs/t/TensorFlow/TensorFlow-2.4.1-fosscuda-2020b.eb 
failed (err: 'build failed (first 300 chars): At least 1 cpu tests 
failed:\n//tensorflow/core/common_runtime:graph_constructor_test')



The EB log file reports an error:

//tensorflow/core/common_runtime:graph_constructor_test FAILED 
TO BUILD


and the log file ends with:

Executed 137 out of 814 tests: 137 tests pass, 1 fails to build and 676 
were skipped.

FAILED: Build did NOT complete successfully

== 2021-05-26 15:30:49,719 build_log.py:169 ERROR EasyBuild crashed with 
an error (at easybuild/base/exceptions.py:124 in __init__): At least 1 
cpu tests failed:
//tensorflow/core/common_runtime:graph_constructor_test (at 
easybuild/easyblocks/t/tensorflow.py:973 in test_step)
== 2021-05-26 15:30:49,719 filetools.py:1810 INFO Removing lock 
/home/modules/software/.locks/_home_modules_software_TensorFlow_2.4.1-fosscuda-2020b.lock... 

== 2021-05-26 15:30:49,721 filetools.py:347 INFO Path 
/home/modules/software/.locks/_home_modules_software_TensorFlow_2.4.1-fosscuda-2020b.lock 
successfully removed.
== 2021-05-26 15:30:49,721 filetools.py:1814 INFO Lock removed: 
/home/modules/software/.locks/_home_modules_software_TensorFlow_2.4.1-fosscuda-2020b.lock 

== 2021-05-26 15:30:49,721 easyblock.py:3414 WARNING build failed (first 
300 chars): At least 1 cpu tests failed:

//tensorflow/core/common_runtime:graph_constructor_test
== 2021-05-26 15:30:49,721 easyblock.py:298 INFO Closing log for 
application name TensorFlow version 2.4.1



Can anyone suggest a fix for this issue?

Thanks,
Ole