p.s. so far, for non-GPU, I have a 6x speed up over the Anaconda3/5.1.0 version on an AVX2-based cluster (1 node, 28 CPUs).  (stick with -O2 [EB default]... -O3 ["opt": True] doesn't help).

Preparing to run same benchmark[*] with the GPU(s) (2xTesla80).

https://github.com/tensorflow/benchmarks/blob/master/scripts/tf_cnn_benchmarks/README.md

Some other notes:

Don't even try to build TensorFlow on RHEL/CentOS6..... its older
kernel doesn't have MADV_HUGEPAGE support.

For TensorFlow, I customized for our site with:

cuda_compute_capabilities = ['3.5', '3.7']  # for Tesla K20 (ada) and K80 (terra)

I could have left out 3.5 since ada is RHEL6.

jack


On 03/14/2018 05:48 AM, Jack Perdue wrote:
+1 !!!!!

I struggled with the same issue (I have no idea where Stephane got his/her copy).

FWIW, here's (attached) what I came up with which includes
that fix and a cleanup of the duplicate libs.

Jack Perdue
Lead Systems Administrator
High Performance Research Computing
TAMU Division of Research
[email protected]    http://hprc.tamu.edu
HPRC Helpdesk: [email protected]

On 03/14/2018 05:35 AM, Joachim Hein wrote:
Hi,

I am trying TensorFlow-1.5.0-goolfc-2017b-Python-3.6.3.eb .  It is looking for a file cudnn-9.0-linux-x64-v7.0.5.15.tgz  , however I am currently getting cudnn-9.0-linux-x64-v7.tgz from the Nvidea download site.  The sha256 sum of the file I just downloaded agrees with the one in the EB-config.  After renaming my download to the name expected by EB, cuDNN builds.

Can the config be upgraded to handle both, old and new name?  Is that something EB supports?  Otherwise we should leave a comment inside the config, that renaming is a work around (one needs a manual download of sources anyway).

Any comments?

Best wishes
   Joachim





Reply via email to