Author: Remi Meier <remi.me...@gmail.com> Branch: extradoc Changeset: r5377:fa7bf3bf9ace Date: 2014-07-30 14:37 +0200 http://bitbucket.org/pypy/extradoc/changeset/fa7bf3bf9ace/
Log: removing a lot of footnotes and other cleanups diff --git a/talk/dls2014/paper/paper.tex b/talk/dls2014/paper/paper.tex --- a/talk/dls2014/paper/paper.tex +++ b/talk/dls2014/paper/paper.tex @@ -288,11 +288,16 @@ \emph{Threading} employs operating system (OS) threads to provide concurrency. It is, however, limited by the GIL and thus does not -provide parallelism\footnote{At this point we should mention that it is indeed -possible to run external functions written in C instead of Python in -parallel. Our work focuses on Python itself and ignores this aspect as -it requires writing in a different language.}. Moreover, the GIL can -only ensure correctness of the interpreter itself: applications are +provide real parallelism. +The only code that is not necessarily restricted by the GIL in CPython +is code written in e.g.\ C instead of Python. External functions can +choose to run without the GIL and thus, in parallel. We ignore this +aspect here as it requires writing in a different language than Python +and is therefore not relevant to our work. + +The GIL can also +only ensure correctness of the interpreter itself: applications using +threads are still required to coordinate concurrent accesses to data structures using conventional methods -- locks being the most common way. @@ -444,28 +449,22 @@ need to adapt these addresses to the segment they live in, which would prevent completely any page sharing because then the binary contents differ between segments. Instead, we store pointers as -\emph{Segment Offsets ($SO$)}, i.e.\ offsets from the start of the -segment. An $SO$ can then be interpreted in any segment by adding to -it the start address of that segment.\footnote{An alternative model -would be to simulate threads using several independent processes in -which all segments are at the same address. This might work well for -demonstration purposes, but it is not really suitable for implementing -an existing programming language: programs assume that most of the -state (file descriptors, external libraries, ...) is shared.} The +\emph{Segment Offsets (SO)}, i.e.\ offsets from the start of the +segment. An SO can then be interpreted in any segment by adding to +it the start address of that segment. The result is called a \emph{Linear Address (LA)}. This is illustrated in Figure~\ref{fig:Segment-Addressing}. The x86 CPUs provide a feature which is called \emph{memory -segmentation}. On the modern x86-64, this is enough to perform the -addition described above very efficiently. We use the segment -register $\%gs$\footnote{The other segment register $\%fs$ is -typically used by the threading library to provide access to -thread-local data. One point of view is that we are using $\%gs$ for -a similar purpose: thread-local data -- but a lot of it.}, which is -available to applications, to point to the current thread's segment -start address. The translation from an $SO$ to an LA is done for us by + segmentation}. On these CPU this is enough to perform the addition +described above very efficiently. On the modern x86-64, there are two +segment registers left: $\%fs$ and $\%gs$. On Linux for example, +$\%fs$ is already used for addressing thread-local data +efficiently. Thus, we use the remaining segment register $\%gs$, which is +still available to applications, to point to the current thread's segment +start address. The translation from an SO to an LA is done for us by using the ``$\%gs\colon$'' prefix before any CPU instruction that -reads or writes memory using an $SO$ address. This is very efficient +reads or writes memory using an SO address. This is very efficient -- we can do it on every object access -- and some compilers support it natively (e.g.\ clang). @@ -478,7 +477,7 @@ In summary, translating a $\%gs:SO$ to a physical address is a two-step process: First the memory segmentation feature of the CPU constructs a linear address. Then, this LA gets mapped by the MMU to -some place in the physical memory. This makes the $SO$ a valid reference +some place in the physical memory. This makes the SO a valid reference to an object in all segments, automatically resolving either to a shared or a private version. @@ -503,8 +502,8 @@ itself, but it is missing conflict detection to be considered a full-fledged STM system. This requires adding \emph{barriers} to the program to register the objects in the read or write set before -reading or writing to them\footnote{Barriers added by an automatic, conservative, and -local program transformation~\cite{felber07}}. Furthermore, we still +reading or writing to them. We do this with an automatic, conservative, and +local program transformation (cf.~\cite{felber07}). Furthermore, we still need to describe the commit protocol. \begin{description} @@ -516,7 +515,7 @@ approach allows the read barrier to be a single, unconditional write into segment-local memory. We only need to \emph{mark} an object as read by the current transaction. The mark is stored in a - segment-local array indexed by a number derived from the $SO$ + segment-local array indexed by a number derived from the SO of the object. Unlike other STM systems, the read barrier does not have to find the @@ -532,10 +531,11 @@ Here, we eagerly detect write-write conflicts by allowing only one transaction modifying an object at a time using write - locks.\footnote{Eager write-write detection is not inherent to our - approach; we may lift this restriction in the future.} Part of making - the object write-ready is also the privatisation of all pages the - object belongs to, as described above. + locks\footnote{Eager write-write conflict detection is not inherent + to our approach; we may lift this restriction in the future if it + proves to be useful.}. To make the object write-ready, this is + also the place to privatise all pages the object belongs to, as + described above. Described below, we also need a write barrier for our generational garbage collector (GC). Interestingly, we are able to share the same @@ -633,7 +633,7 @@ \lstinline!stm_commit_transaction()! require saving object references. -In the following sections, whenever we use $SO$, we go through the +In the following sections, whenever we use SO, we go through the address translation to get to the actual contents of an object. This is also signified by the type \lstinline!object_t!. This type is special as it causes the @@ -667,12 +667,12 @@ Another issue is the support of irreversible operations, like calls to external functions or general -input~/ output. If we are allowed to, we commit the transaction -before and start a new one after these operations\footnote{The language -specification needs to specify if this is allowed.}. Within an atomic +input~/ output. If Python allows it, we commit the transaction before +and start a new one after these operations (the same way as the GIL is +released around thread-safe external functions). Within an atomic block, however, the transaction is unbreakable, and we need to turn -the transaction inevitable so that it -cannot abort and strong isolation is guaranteed. +the transaction inevitable so that it cannot abort and strong +isolation is guaranteed at all times. @@ -699,7 +699,7 @@ segmentation violation when accessed. We use this to detect erroneous dereferencing of \lstinline!NULL! references. All $\%gs{::}SO$ translated to linear addresses will point to NULL pages - if $SO$ is set to \lstinline!NULL!. + if SO is set to \lstinline!NULL!. \item [{Segment-local~data:}] Some area private to the segment that contains segment-local information for bookkeeping. \item [{Read~markers:}] These are private pages that store information about @@ -707,8 +707,11 @@ segment. \item [{Nursery:}] This private area contains all the freshly allocated objects (\emph{young objects}) of the current transaction. The GC - uses bump-pointer allocation in this area to allocate objects in the - first generation. + uses bump-pointer allocation in this area to allocate objects of the + first generation. To make sure that only objects of the current + transaction reside in this space, a collection of the first + generation happens at the end of every transaction (see + section~\ref{sub:gc}). \item [{Old~object~space:}] These pages are the ones that are really shared between segments. They mostly contain old objects but also some young ones that were too big to be allocated in the nursery. @@ -717,13 +720,13 @@ Note, since the above configuration is currently specified at compile time, all these areas are at offsets inside the segments known to the compiler. This makes some checks very efficient, e.g.\ checking -if an object resides in the nursery only requires comparing its $SO$ +if an object resides in the nursery only requires comparing its SO to the static boundaries of the nursery. It is possible that we want some parameters to be configurable at startup or even during the runtime. In that case we can still use a compile-time specified maximum so that these checks are still efficient. E.g.\ limiting the maximum amount of memory available to the application statically to a few -terabytes is fine because it corresponds to virtual memory, +terabytes is fine because it corresponds to virtual memory; the real physical memory is assigned on-demand by the operating system. @@ -759,7 +762,7 @@ -\subsection{Garbage Collection} +\subsection{Garbage Collection\label{sub:gc}} Garbage collection plays a big role in our TM system. The GC is generational and has two generations: the \emph{young} and the @@ -828,7 +831,7 @@ and already mentioned in Section~\ref{sub:Setup}. This area can be seen as a continuous, segment-local array of bytes -that is indexed with an object's reference ($SO$) divided by 16. So +that is indexed with an object's reference (SO) divided by 16. So that each object has its own read marker, all objects have a size of at least 16 bytes. Otherwise there could be false conflicts when reading from two adjacent objects that share a single marker. @@ -839,9 +842,9 @@ \lstinline!false! on commit and only need to do this every 255 transactions. The whole code for the barrier shown in Listing~\ref{lst:rb} is easily optimisable for compilers as well as -perfectly predictable for CPUs\footnote{Additional benefit: The read -barrier is not constrained to execute before the actual read -- both -the compiler and the CPU are free to reorder or group them.}. +perfectly predictable for CPUs. Additionally, the read barrier is not +constrained to execute before the actual read -- both the compiler and +the CPU are free to reorder or group them for efficiency. \begin{code}[h] \begin{lstlisting} @@ -876,7 +879,7 @@ The \textbf{fast path} of the write barrier is very simple. We only need to check for the flag \lstinline!WRITE_BARRIER! in the object's -header through its $SO$ and call the slow path if it is set. This flag +header through its SO and call the slow path if it is set. This flag is set either if the object is old and comes from an earlier transaction, or if there was a minor collection (in this case, the flag is added again on all objects, including new overflow objects). The flag @@ -946,7 +949,7 @@ For the \emph{STM part}, we first perform a read barrier on the object. We then try to acquire its write lock. \lstinline!write_locks! is a simple, \emph{global} array of bytes that is indexed with the -$SO$ of the object divided by 16. Note, ``global'' here really means +SO of the object divided by 16. Note, ``global'' here really means it is a single array with data for all segments, there is no address translation going on to access its elements contrary to e.g.\ the read markers array. If we already own the lock, we are done. @@ -961,7 +964,8 @@ In all cases, we remove the \lstinline!WRITE_BARRIER! flag from the object before we return. Thus, we never trigger the slow path again before we do the next minor collection or we start the next -transaction.\footnote{We always do a minor collection during a commit.} +transaction (remember: we always perform a minor collection as part of +each commit anyway). Note that we have three kinds of objects: \emph{young, overflow} and \emph{old} objects. Young and overflow objects were created in the @@ -995,16 +999,20 @@ Committing a transaction needs a bit more work. First, we synchronise all threads so that the committing one is the only one running and all -the others are waiting in safe points.\footnote{This design choice is -important to support our very cheap barriers. It is a trade-off that -works best if the number of concurrent threads is small. -See section~\ref{subsub:sync} for how we mitigate the costs.} +the others are waiting in safe points. We then go through the write set (\lstinline!modified_old_objects!) and check the corresponding \lstinline!read_markers! in other threads~/ segments. If we detect a read-write conflict, we perform contention management to either abort us or the other transaction, or to simply wait a bit (see Section~\ref{subsub:contentionmanagement}). +Note that synchronising all threads is an important design choice we +made to support our very cheap read barriers. Since they now consist +only of a single unconditional write to some memory location, all +threads need to be stopped while we detect conflicts. This is a +trade-off that works best if the number of concurrent threads is +small. + After verifying that there are no conflicts anymore, we copy all our changes (i.e.\ objects in our write set) to all other segments, including the sharing-segment. This is safe since we synchronised all @@ -1292,8 +1300,8 @@ code worthwhile, we increased the input size of all benchmarks to get reasonable execution times (denoted with ``(large)'' in benchmark names). -We also ran these benchmarks on Jython\footnote{version -2.7b1~\cite{webjython}}, an implementation of Python on top of the +We also ran these benchmarks on Jython (v2.7b1 from~\cite{webjython}), +an implementation of Python on top of the Java Virtual Machine (JVM). Instead of a GIL, this interpreter uses fine-grained locking for synchronisation. Even though it can use the JVM's JIT compiler, its performance in these benchmarks is behind _______________________________________________ pypy-commit mailing list pypy-commit@python.org https://mail.python.org/mailman/listinfo/pypy-commit