A Tuesday 11 January 2011 06:45:28 Mark Wiebe escrigué: > On Mon, Jan 10, 2011 at 11:35 AM, Mark Wiebe <mwwi...@gmail.com> wrote: > > I'm a bit curious why the jump from 1 to 2 threads is scaling so > > poorly. > > > > Your timings have improvement factors of 1.85, 1.68, 1.64, and > > 1.79. Since > > > > the computation is trivial data parallelism, and I believe it's > > still pretty far off the memory bandwidth limit, I would expect a > > speedup of 1.95 or higher. > > It looks like it is the memory bandwidth which is limiting the > scalability.
Indeed, this is an increasingly important problem for modern computers. You may want to read: http://www.pytables.org/docs/CISE-12-2-ScientificPro.pdf ;-) > The slower operations scale much better than faster > ones. Below are some timings of successively faster operations. > When the operation is slow enough, it scales like I was expecting... [clip] Yeah, for another example on this with more threads, see: http://code.google.com/p/numexpr/wiki/MultiThreadVM OTOH, I was curious about the performance of the new iterator with Intel's VML, but it seems to work decently too: $ python bench/vml_timing.py (original numexpr, *no* VML support) *************** Numexpr vs NumPy speed-ups ******************* Contiguous case: 1.72 (mean), 0.92 (min), 3.07 (max) Strided case: 2.1 (mean), 0.98 (min), 3.52 (max) Unaligned case: 2.35 (mean), 1.35 (min), 3.31 (max) $ python bench/vml_timing.py (original numexpr, VML support) *************** Numexpr vs NumPy speed-ups ******************* Contiguous case: 3.83 (mean), 1.1 (min), 10.19 (max) Strided case: 3.21 (mean), 0.98 (min), 7.45 (max) Unaligned case: 3.6 (mean), 1.47 (min), 7.87 (max) $ python bench/vml_timing.py (new iter numexpr, VML support) *************** Numexpr vs NumPy speed-ups ******************* Contiguous case: 3.56 (mean), 1.12 (min), 7.38 (max) Strided case: 2.37 (mean), 0.09 (min), 7.63 (max) Unaligned case: 3.56 (mean), 2.08 (min), 5.88 (max) However, there a couple of quirks here. 1) The original Numexpr performs generally faster than the iter version. 2) The strided case is quite worse for the iter version. I've isolated the tests that performs worse for the iter version, and here are a couple of samples: *************** Expression: exp(f3) numpy: 0.0135 numpy strided: 0.0144 numpy unaligned: 0.0200 numexpr: 0.0020 Speed-up of numexpr over numpy: 6.6584 numexpr strided: 0.1495 Speed-up of numexpr over numpy: 0.0962 numexpr unaligned: 0.0049 Speed-up of numexpr over numpy: 4.0859 *************** Expression: sin(f3)>cos(f4) numpy: 0.0291 numpy strided: 0.0366 numpy unaligned: 0.0407 numexpr: 0.0166 Speed-up of numexpr over numpy: 1.7518 numexpr strided: 0.1551 Speed-up of numexpr over numpy: 0.2361 numexpr unaligned: 0.0175 Speed-up of numexpr over numpy: 2.3246 Maybe you can shed some light on what's going on here (shall we discuss this off-the-list so as to not bore people too much?). -- Francesc Alted _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion