Ralf Gommers <ralf.gomm...@gmail.com> wrote: > For most routines performance seems to be comparable, and both are much > better than ATLAS. When there's a significant difference, I have the > impression that OpenBLAS is more often the slower one (example: > <a > href="https://github.com/xianyi/OpenBLAS/issues/533">https://github.com/xianyi/OpenBLAS/issues/533</a>).
Accelerate is in general better optimized for level-1 and level-2 BLAS than OpenBLAS. There are two reasons for this: First, OpenBLAS does not use AVX for these kernels, but Accelerate does. This is the more important difference. It seems the OpenBLAS devs are now working on this. Second, the thread pool in OpenBLAS is not as scalable on small tasks as the "Grand Central Dispatch" (GCD) used by Accelerate. The GCD thread-pool used by Accelerate is actually quite unique in having a very tiny overhead: It takes only 16 extra opcodes (IIRC) for running a task on the global parallel queue instead of the current thread. (Even if my memory is not perfect and it is not exactly 16 opcodes, it is within that order of magnitude.) GCD can do this because the global queues and threadpool is actually built into the kernel of the OS. On the other hand, OpenBLAS and MKL depends on thread pools managed in userspace, for which the scheduler in the OS have no special knowledge. When you need fine-grained parallelism and synchronization, there is nothing like GCD. Even a user-space spinlock will have bigger overhead than a sequential queue in GCD. With a userspace threadpool all threads are scheduled on a round robin basis, but with GCD the scheduler has special knowledge about the tasks put on the queues, and executes them as fast as possible. Accelerate therefore has an unique advantage when running level-1 and 2 BLAS routines, with which OpenBLAS or MKL probably never can properly compete. Programming with GCD can actually often be counter-intuitive to someone used to deal with OpenMP, MPI or pthreads. For example it is often better to enqueue a lot of small tasks instead of splitting up the computation into large chunks of work. When parallelising a tight loop, a chunk size of 1 can be great on GCD but is likely to be horrible on OpenMP and anything else that has userspace threads. Sturla _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion