Author: Hakan Ardo <[email protected]>
Branch: extradoc
Changeset: r4614:f1926fc5fc60
Date: 2012-08-16 14:43 +0200
http://bitbucket.org/pypy/extradoc/changeset/f1926fc5fc60/

Log:    typos and clarifications

diff --git a/talk/dls2012/licm.pdf b/talk/dls2012/licm.pdf
index 
53e9a461f7d0e384c8c7fba88a6002c1337aaeb1..69f4a54d80bb6983114b698f3ac8e463a4831d1c
GIT binary patch

[cut]

diff --git a/talk/dls2012/paper.tex b/talk/dls2012/paper.tex
--- a/talk/dls2012/paper.tex
+++ b/talk/dls2012/paper.tex
@@ -125,7 +125,7 @@
 
 \begin{abstract}
 One of the nice properties of a tracing JIT is that many of its optimizations
-are simple requiring one forward pass only. This is not true for 
loop-invariant code
+are simple, requiring one forward pass only. This is not true for 
loop-invariant code
 motion which is a very important optimization for code with tight kernels.
 Especially for dynamic languages that typically perform quite a lot of loop 
invariant
 type checking, boxed value unwrapping and virtual method lookups.
@@ -823,7 +823,7 @@
                  \cdots, m\left(\hat J_{|\hat J|}\right)\right)
   .
 \end{equation}
-In the optimized trace $I$ is replaced by $\hat I$ and $K$ by $\hat
+In the optimized trace $J$ is replaced by $\hat J$ and $K$ by $\hat
 K$. The trace from Figure~\ref{fig:unopt-trace} will be optimized to
 the trace in Figure~\ref{fig:virtual-trace}.
 
@@ -991,11 +991,13 @@
   fixpoint arithmetic with 16 bits precision. In Python there is only
   a single implementation of the benchmark that gets specialized
   depending on the class of it's input argument, $y$, while in C,
-  there are three different implementations.
+  there are three different implementations. In Lua there is no support for
+  integers so only two versions are provided: float and Fix16. Here Fix16 is a 
custom class
+  that implements scaled floating point arithmetic.
 \item {\bf conv3}$\left(n\right)$: one-dimensional convolution with fixed 
kernel-size $3$. A single loop
-is used to calculate a vector ${\bf b} = \left(b_1, \cdots, b_n\right)$ from a 
vector
+is used to calculate a vector ${\bf b} = \left(b_1, \cdots, b_{n-2}\right)$ 
from a vector
 ${\bf a} = \left(a_1, \cdots, a_n\right)$ and a kernel ${\bf k} = \left(k_1, 
k_2, k_3\right)$ using 
-$b_i = k_3 a_i + k_2 a_{i+1} + k_1 a_{i+2}$ for $1 \leq i \leq n$. Both the 
output vector, $\bf b$, 
+$b_i = k_3 a_i + k_2 a_{i+1} + k_1 a_{i+2}$ for $1 \leq i \leq n-2$. Both the 
output vector, $\bf b$, 
 and the input vectors, $\bf a$ and $\bf k$, are allocated prior to running the 
benchmark. It is executed 
 with $n=10^5$ and $n=10^6$.
 \item {\bf conv5}$\left(n\right)$: one-dimensional convolution with fixed 
kernel-size $5$. Similar to conv3, but with 
@@ -1014,7 +1016,7 @@
     k_{1,3} a_{i+1,j-1} &+& k_{1,2} a_{i+1,j} &+& k_{1,1} a_{i+1,j+1}  \\
   \end{array}
 \end{equation}
-for $1 \leq i \leq m$ and $1 \leq j \leq n$.
+for $2 \leq i \leq m-1$ and $2 \leq j \leq n-1$.
 The memory for storing the matrices are again allocated outside the benchmark 
and $(n,m)=(1000,1000)$ 
 as well as $(n,m)=(1000000,3)$ was used.
 \item {\bf dilate3x3}$\left(n\right)$: two-dimensional dilation with kernel of 
fixed
@@ -1051,7 +1053,7 @@
 For the C implementations it is
 implemented as a C++ class. The other benchmarks are implemented in
 plain C. All the benchmarks except sqrt operate on C double-precision floating
-point numbers, both in the Python and the C code.
+point numbers, both in the Python, C and Lua code.
 
 In addition we also ported the 
 SciMark\footnote{\texttt{http://math.nist.gov/scimark2/}} benchmakts to 
python, and compared 
@@ -1093,7 +1095,7 @@
 We also run PyPy with loop peeling optimization and without (but otherwise
 identical).
 
-For PyPy 10 iterations were run, prefaced with 3 iterations for warming up.
+For PyPy and Lua 10 iterations were run, prefaced with 3 iterations for 
warming up.
 Due to benchmarks taking large amounts of time on CPython, only one run
 was performed, prefaced with one warmup run for Psyco.
 For GCC 5 iterations
@@ -1107,7 +1109,11 @@
 speedup of loop peeling is 70\%, which makes benchmark times
 comparable with native-compiled C code. We attribute the performance gap to C 
code to
 the relative immaturity of RPython's JIT machine code backend as well as 
missing
-optimizations, like instruction scheduling.
+optimizations, like instruction scheduling. Also, in case of nested loops, 
+operations are only moved out of the 
+innermost loop. That is an issue when the innermost loop is 
+short and a significant amount of time is spent in the outer loops. This is 
the case 
+with for example SparseMatMult.
 
 Other interesting interpreters that are helped greatly by this optimization are
 for example our Prolog interpreter written in
@@ -1164,7 +1170,7 @@
 
 The type specialization described by Gal \etal~\cite{gal_trace-based_2009} can
 be seen as doing a similar optimization (again by manually implementing it)
-than the one described in Section~\ref{sub:allocation}: The effect of both is
+as the one described in Section~\ref{sub:allocation}: The effect of both is
 that type checks are fully done before a loop is even entered.
 
 
_______________________________________________
pypy-commit mailing list
[email protected]
http://mail.python.org/mailman/listinfo/pypy-commit

Reply via email to