[pypy-commit] extradoc extradoc: text as copied from blogpost draft

mattip Mon, 31 Oct 2016 10:04:55 -0700

Author: Matti Picus <[email protected]>
Branch: extradoc
Changeset: r5743:5cf27efd4195
Date: 2016-10-31 18:30 +0200
http://bitbucket.org/pypy/extradoc/changeset/5cf27efd4195/


Log:    text as copied from blogpost draft

diff --git a/blog/draft/vectorization_extended.rst 
b/blog/draft/vectorization_extended.rst
new file mode 100644
--- /dev/null
+++ b/blog/draft/vectorization_extended.rst
@@ -0,0 +1,109 @@
+We are happy to announce that both the PowerPC backend and the s390x backend
+have been enhanced. Both are now capable to emit SIMD instructions vectorized
+loops. Special thanks to IBM for funding this work.
+
+
+If you are not familiar with this topic you can more details here.
+
+
+There are many more enhancements under the hood. Most notably, all pure
+operations are now delayed to the latest possible point. In some cases indices
+has been calculated more than once or they needed an additional register,
+because the old value is still used. Additionally it is now possible to load
+quadword aligned memory in both ppc and s390x (x86 currently cannot do that).
+
+NumPy & CPyExt
+--------------
+
+The community and core development effort pushes CPyExt towards a complete, but
+emulated layer for CPython C extensions. This is great, because the one
+restriction preventing the deployment of PyPy in several scenarios is soon 
going
+to be removed. We advocate not to use the CPyExt, but rather to not write C 
code
+at all (let PyPy speed up your Python code) or use cffi.
+
+
+The work done in this project helps micro numpy (NumPyPy) to speed up the
+operations for ppc and s390x. But, NumPyPy and NumPy ... do we need both? There
+are several cases where one of them is not the best performing solution. Our
+plans are to integrate both, use one of the solutions where we know the other
+one will not perform well.
+
+
+Just to give you an idea why this is a benefit:
+
+
+NumPy arrays can carry custom dtypes and apply user defined python functions on
+the arrays. How could one optimize this kind of scenario? In traditional setup,
+you cannot. But as soon as Micro NumPy is turned on, you can suddenly JIT
+compile this code and vectorize it.
+
+Another example is element access that occurs frequently, or any other calls
+that cross to the C level more frequently.
+
+
+Benchmarks
+----------
+
+Let's have a look at some benchmarks reusing mikefc's numpy benchmark suite. 
The
+suite only runs a subset of all commands showing that the core functionality is
+properly working. Additionally it has been rewritten to use perf instead of the
+timeit stdlib module.
+
+
+Setup
+-----
+x86 runs on a Intel i7-2600 clocked at 3.40GHz using 4 cores. PowerPC runs on
+the Power 8 clocked at 3.425GHz providing 160 cores. Last but not least the
+mainframe machine clocked up to 4 GHz, but fully virtualized (as it is common
+for such machines).
+
+
+As you can see all machines run very different configurations. It does not make
+sense to compare across platforms, but rather implementations on the same
+platform.
+
+
+
+
+
+
+
+Blue shows CPython 2.7.10+ available on that platform using the latest NumPy
+(1.11). Micro NumPy is used for PyPy. PyPy+ indicates that the vectorization
+optimization is turned on.
+
+All bar charts show the median value of all runs (5 samples, 100 loops, 10 
inner
+loops, for the operations on vectors (not matrices) the loops are set to 1000).
+PyPy additionally gets 3 extra executions to warmup the JIT.
+
+
+The comparison is really comparing speed of machine code. It compares the 
PyPy's
+JIT output vs GCC's output. It has little to do with the speed of the
+interpreter.
+
+
+Both new SIMD backends speedup the numeric kernels. Some times it is near to 
the
+speed of CPython (note that PyPy will execute the machine code kernel after a
+interpreting it at least 1000 times), some times it is faster. The maximum
+parallelism very much depends on the extension emitted by the compiler. All
+three SIMD backends have the same core register size (which is 128 bit). This
+means that all three behave similar but ppc and s390x gain more because they 
can
+load 128bit of memory from quadword aligned memory.
+
+
+Future directions
+-----------------
+
+Python seems to be in an ongoing transition from a language used mostly for web
+development to also be used in data science. This is currently starting to
+emerge in Europe and Python is already heavily used for data science in the
+United States of America and many other places around the world.
+
+
+I believe that PyPy has a valuable contribution for data scientists, helping
+them to rapidly write scientific programs in Python and run them at near native
+speed. If you happen to be in that situation, we are eager to hear you feedback
+or resolve your issues and also work together to improve your simulations,
+calculations, .... Just get in touch!
+
+Richard Plangger (plan_rich) and the PyPy team
_______________________________________________
pypy-commit mailing list
[email protected]
https://mail.python.org/mailman/listinfo/pypy-commit

[pypy-commit] extradoc extradoc: text as copied from blogpost draft

Reply via email to