This is probably related to openblas, but it seems to be that tanh() is not multi-threaded, which hinders a considerable speed improvement. For example, MATLAB does multi-thread it and gets something around 3x speed-up over the single-threaded version.
For example, x = rand(100000,200); @time y = tanh(x); yields: - 0.71 sec in Julia - 0.76 sec in matlab with -singleCompThread - and 0.09 sec in Matlab (this one uses multi-threading by default) Good news is that julia (w/openblas) is competitive with matlab single-threaded version, though setting the env variable OPENBLAS_NUM_THREADS doesn't have any effect on the timings, nor I see higher CPU usage with 'top'. Is there an override for OPENBLAS_NUM_THREADS in julia? what am I missing?