p.s. so far, for non-GPU, I have a 6x speed up over the Anaconda3/5.1.0
version
on an AVX2-based cluster (1 node, 28 CPUs). (stick with -O2 [EB
default]... -O3 ["opt": True] doesn't help).
Preparing to run same benchmark[*] with the GPU(s) (2xTesla80).
https://github.com/tensorflow/benchmarks/blob/master/scripts/tf_cnn_benchmarks/README.md
Some other notes:
Don't even try to build TensorFlow on RHEL/CentOS6..... its older
kernel doesn't have MADV_HUGEPAGE support.
For TensorFlow, I customized for our site with:
cuda_compute_capabilities = ['3.5', '3.7'] # for Tesla K20 (ada) and
K80 (terra)
I could have left out 3.5 since ada is RHEL6.
jack
On 03/14/2018 05:48 AM, Jack Perdue wrote:
+1 !!!!!
I struggled with the same issue (I have no idea where Stephane got
his/her copy).
FWIW, here's (attached) what I came up with which includes
that fix and a cleanup of the duplicate libs.
Jack Perdue
Lead Systems Administrator
High Performance Research Computing
TAMU Division of Research
[email protected] http://hprc.tamu.edu
HPRC Helpdesk: [email protected]
On 03/14/2018 05:35 AM, Joachim Hein wrote:
Hi,
I am trying TensorFlow-1.5.0-goolfc-2017b-Python-3.6.3.eb . It is
looking for a file cudnn-9.0-linux-x64-v7.0.5.15.tgz , however I am
currently getting cudnn-9.0-linux-x64-v7.tgz from the Nvidea download
site. The sha256 sum of the file I just downloaded agrees with the
one in the EB-config. After renaming my download to the name
expected by EB, cuDNN builds.
Can the config be upgraded to handle both, old and new name? Is that
something EB supports? Otherwise we should leave a comment inside
the config, that renaming is a work around (one needs a manual
download of sources anyway).
Any comments?
Best wishes
Joachim