[trans-mem] Use per-transaction reader flags for the serial lock

Torvald Riegel Fri, 05 Aug 2011 10:35:50 -0700

This is a two-piece patch set.

patch8:


The first patch merely keeps a list of all threads with transactions,
registering and deregistering gtm_transaction objects during their
construction and destruction. The aligment attribute for the start of
the shared part of a gtm_transaction object relies on allocating those
objects on separate cachelines (and thus aligned at cacheline
granularity). This has not been implemented yet, but is requested by the
xmalloc call for gtm_transaction.


patch9:

This is the major part, changing the serial lock implementation to a
have split and per-transaction/per-thread "reader-active" flags. See the
updates to libitm.texi for more information. Currently, only the
pthreads-based implementation is ready, but the futex-based one should
follow soon.
For the assumed common case of infrequent serial transactions, this
split of the flags reduces any contention between readers, but adds some
overhead when entering serial mode, and two barriers in the reader
fast-path.

For a simple performance test with the serial-on-write TM method and the
red-black integer set microbenchmark, I got the following numbers for
1/2/4 threads, and x% being the percentage of set update operations in
the benchmark (i.e., which percentage of transactions has to upgrade to
serial mode). "old" is the previous implementation, "new" the new one,
"nolock" is without reader-side locking (unsafe for >0%), "nofence" is
"new" but without the necessary fences in the reader fastpath (unsafe
for >0%, too). Numbers are throughput (million txns per 10s):

old 0%:   52 / 48 /  36

new 0%:   49 / 90 / 120
new 1%:   47 / 59 /  27
new 20%   34 /  6 /   3
new 100%: 15 /  1 /   1

nolock 0%:   56 / 104 / 120
nolock 1%:   55
nolock 20%:  38
nolock 100%: 16
nofence 0%:  55 / 100 / 120

As you can see, the old r/w lock wasn't performing well (I don't have
precise numbers anymore, but there were even larger slowdowns for >0%
updates).
The new implementation scales well when there is no serial mode, but has
higher single-thread overhead due to the fences. I'm not happy about the
slowdowns that occur when there is some serial mode (ideally, this
should be just flat for "new 100%"), even for 1% throughput already
suffers.
However, there is no spinning yet in the implementation. Adding and
tuning that should reduce the overheads. (At least for the similar
implementation in tinySTM++ which had only spinning, there was a bit of
slowdown for 100%, but not that much as above, even with ASF).
Furthermore, we could try proper queue locks (or other options) to
reduce writer-writer contention overheads, go for a simpler r/w lock for
low thread counts, or try to optimize how writers block for readers. But
in any case, I think this should be tuned later, and at least again when
we have real-world workloads (eg, to find practical values for how long
to spin).

Bottom line: I think it's an improvement, but there remains a lot of
tuning to be done.

OK for branch?

commit d51d095e341d16749f6d1c8153817db7fb31945a
Author: Torvald Riegel <trie...@redhat.com>
Date:   Fri Jul 29 22:12:14 2011 +0200

    Maintain a list of all threads' transactions.
    
        * libitm_i.h (next_tx): Add list of all threads' transaction.
        * beginend.cc (GTM::gtm_transaction::begin_transaction): Register
        transaction with list of transactions and ...
        (thread_exit_handler): ... deregister here.

diff --git a/libitm/beginend.cc b/libitm/beginend.cc
index 08592a3..fc7cb8d 100644
--- a/libitm/beginend.cc
+++ b/libitm/beginend.cc
@@ -29,6 +29,7 @@ using namespace GTM;
 
 __thread gtm_thread GTM::_gtm_thr;
 gtm_rwlock GTM::gtm_transaction::serial_lock;
+gtm_transaction *GTM::gtm_transaction::list_of_tx = 0;
 
 gtm_stmlock GTM::gtm_stmlock_array[LOCK_ARRAY_SIZE];
 gtm_version GTM::gtm_clock;
@@ -75,6 +76,20 @@ thread_exit_handler(void *dummy __attribute__((unused)))
     {
       if (tx->nesting > 0)
         GTM_fatal("Thread exit while a transaction is still active.");
+
+      // Deregister this transaction.
+      gtm_transaction::serial_lock.write_lock ();
+      gtm_transaction **prev = &gtm_transaction::list_of_tx;
+      for (; *prev; prev = &(*prev)->next_tx)
+        {
+          if (*prev == tx)
+            {
+              *prev = (*prev)->next_tx;
+              break;
+            }
+        }
+      gtm_transaction::serial_lock.write_unlock ();
+
       delete tx;
       set_gtm_tx(NULL);
     }
@@ -124,6 +139,12 @@ GTM::gtm_transaction::begin_transaction (uint32_t prop, 
const gtm_jmpbuf *jb)
       tx = new gtm_transaction;
       set_gtm_tx(tx);
 
+      // Register this transaction with the list of all threads' transactions.
+      serial_lock.write_lock ();
+      tx->next_tx = list_of_tx;
+      list_of_tx = tx;
+      serial_lock.write_unlock ();
+
       if (pthread_once(&tx_release_once, thread_exit_init))
         GTM_fatal("Initializing tx release TLS key failed.");
       // Any non-null value is sufficient to trigger releasing of this
diff --git a/libitm/libitm_i.h b/libitm/libitm_i.h
index 7d475be..e2445c7 100644
--- a/libitm/libitm_i.h
+++ b/libitm/libitm_i.h
@@ -64,6 +64,11 @@ typedef unsigned int gtm_word __attribute__((mode (word)));
 #include "dispatch.h"
 #include "containers.h"
 
+// The size of one line in hardware caches (in bytes).
+#ifndef HW_CACHELINE_SIZE
+#define HW_CACHELINE_SIZE 64
+#endif
+
 namespace GTM HIDDEN {
 
 // A dispatch table parameterizes the implementation of the STM.
@@ -187,11 +192,21 @@ struct gtm_transaction
   uint32_t restart_reason[NUM_RESTARTS];
   uint32_t restart_total;
 
+  // *** The shared part of gtm_transaction starts here. ***
+  // Shared state is on separate cachelines to avoid false sharing with
+  // thread-local parts of gtm_transaction.
+
+  // Points to the next transaction in the list of all threads' transactions.
+  gtm_transaction *next_tx __attribute__((__aligned__(HW_CACHELINE_SIZE)));
+
   // The lock that provides access to serial mode.  Non-serialized
   // transactions acquire read locks; a serialized transaction aquires
   // a write lock.
   static gtm_rwlock serial_lock;
 
+  // The head of the list of all threads' transactions.
+  static gtm_transaction *list_of_tx;
+
   // In alloc.cc
   void commit_allocations (bool, aa_tree<uintptr_t, gtm_alloc_action>*);
   void record_allocation (void *, void (*)(void *));

commit 4979a40778f450854bc44407024623c56722e392
Author: Torvald Riegel <trie...@redhat.com>
Date:   Fri Aug 5 18:58:29 2011 +0200

    Use per-transaction reader flags for the serial lock (gtm_rwlock).
    
        * config/posix/rwlock.cc (gtm_rwlock::read_lock): Changed locking
        implementation.
        (gtm_rwlock::read_unlock): Same.
        (gtm_rwlock::write_lock_generic): New. Generalized from ...
        (gtm_rwlock::write_lock, gtm_rwlock::write_upgrade): ... these.
        * libitm_i.h (GTM::gtm_transaction): Added shared_state.
        * config/posix/rwlock.h (GTM::gtm_rwlock): Removed a_reader and
        w_upgrade. Replaced by per-transaction flags (in shared_state).
        Added c_confirmed_writers.
        (GTM::gtm_rwlock::read_lock, GTM::gtm_rwlock::read_unlock,
        GTM::gtm_rwlock::write_upgrade): Add tx parameter.
        * retry.cc (GTM::gtm_transaction::decide_retry_strategy): Same.
        * method-serial.cc (GTM::gtm_transaction::serialirr_mode): Same.
        * beginend.cc (GTM::gtm_transaction::begin_transaction,
        _ITM_abortTransaction, GTM::gtm_transaction::trycommit): Same.
        * libitm.texi: Document locking conventions and implementations in
        libitm.

diff --git a/libitm/beginend.cc b/libitm/beginend.cc
index fc7cb8d..5c2aed5 100644
--- a/libitm/beginend.cc
+++ b/libitm/beginend.cc
@@ -232,7 +232,7 @@ GTM::gtm_transaction::begin_transaction (uint32_t prop, 
const gtm_jmpbuf *jb)
         }
       else
         {
-          serial_lock.read_lock ();
+          serial_lock.read_lock (tx);
         }
 
       set_abi_disp (disp);
@@ -391,7 +391,7 @@ _ITM_abortTransaction (_ITM_abortReason reason)
       if (tx->state & gtm_transaction::STATE_SERIAL)
         gtm_transaction::serial_lock.write_unlock ();
       else
-        gtm_transaction::serial_lock.read_unlock ();
+        gtm_transaction::serial_lock.read_unlock (tx);
       tx->state = 0;
 
       GTM_longjmp (&tx->jb, a_abortTransaction | a_restoreLiveVariables,
@@ -437,7 +437,7 @@ GTM::gtm_transaction::trycommit ()
       if (state & gtm_transaction::STATE_SERIAL)
        gtm_transaction::serial_lock.write_unlock ();
       else
-       gtm_transaction::serial_lock.read_unlock ();
+       gtm_transaction::serial_lock.read_unlock (this);
       state = 0;
 
       return true;
diff --git a/libitm/config/posix/rwlock.cc b/libitm/config/posix/rwlock.cc
index 91a31a2..e033567 100644
--- a/libitm/config/posix/rwlock.cc
+++ b/libitm/config/posix/rwlock.cc
@@ -33,7 +33,7 @@ gtm_rwlock::gtm_rwlock()
   : mutex (PTHREAD_MUTEX_INITIALIZER),
     c_readers (PTHREAD_COND_INITIALIZER),
     c_writers (PTHREAD_COND_INITIALIZER),
-    c_upgrade (PTHREAD_COND_INITIALIZER),
+    c_confirmed_writers (PTHREAD_COND_INITIALIZER),
     summary (0),
     a_readers (0),
     w_readers (0),
@@ -45,20 +45,52 @@ gtm_rwlock::~gtm_rwlock()
   pthread_mutex_destroy (&this->mutex);
   pthread_cond_destroy (&this->c_readers);
   pthread_cond_destroy (&this->c_writers);
-  pthread_cond_destroy (&this->c_upgrade);
 }
 
 // Acquire a RW lock for reading.
 
 void
-gtm_rwlock::read_lock ()
+gtm_rwlock::read_lock (gtm_transaction *tx)
 {
+  // Fast path: first announce our intent to read, then check for conflicting
+  // intents to write. The barrier makes sure that this happens in exactly
+  // this order.
+  tx->shared_state = 0;
+  __sync_synchronize();
+  unsigned int sum = this->summary;
+  if (likely(!(sum & (a_writer | w_writer))))
+    return;
+
+  // There seems to be an active, waiting, or confirmed writer, so enter the
+  // mutex-based slow path. To try to keep the number of readers small that
+  // the writer will see, we clear our read flag right away before entering
+  // the critical section. Otherwise, the writer would have to wait for us to
+  // get into the critical section. (Note that for correctness, this only has
+  // to happen before we leave the slow path and before we wait for any
+  // writer).
+  // ??? Add a barrier to enforce early visibility of this?
+  tx->shared_state = ~(typeof tx->shared_state)0;
+
   pthread_mutex_lock (&this->mutex);
 
-  unsigned int sum = this->summary;
+  // Read summary again after acquiring the mutex because it might have
+  // changed during waiting for the mutex to become free.
+  sum = this->summary;
+
+  // If there is a writer waiting for readers, wake it up. Only do that if we
+  // might be the last reader that could do the wake-up, otherwise skip the
+  // wake-up but decrease a_readers to show that we have entered the slow path.
+  // This has to happen before we wait for any writers or upgraders.
+  // See write_lock_generic() for further explanations.
+  if (this->a_readers > 0)
+    {
+      this->a_readers--;
+      if (this->a_readers == 0)
+        pthread_cond_signal(&this->c_confirmed_writers);
+    }
 
-  // If there is a waiting upgrade, or an active writer, we must wait.
-  while (sum & (w_upgrade | a_writer | w_writer))
+  // If there is an active or waiting writer, we must wait.
+  while (sum & (a_writer | w_writer))
     {
       this->summary = sum | w_reader;
       this->w_readers++;
@@ -69,25 +101,42 @@ gtm_rwlock::read_lock ()
     }
 
   // Otherwise we can acquire the lock for read.
-  this->summary = sum | a_reader;
-  this->a_readers++;
+  tx->shared_state = 0;
 
   pthread_mutex_unlock(&this->mutex);
 }
 
 
-// Acquire a RW lock for writing.
+// Acquire a RW lock for writing. Generic version that also works for
+// upgrades.
+// Note that an upgrade might fail (and thus waste previous work done during
+// this transaction) if there is another thread that tried to go into serial
+// mode earlier (i.e., upgrades do not have higher priority than pure writers).
+// However, this seems rare enough to not consider it further as we need both
+// a non-upgrade writer and a writer to happen to switch to serial mode
+// concurrently. If we'd want to handle this, a writer waiting for readers
+// would have to coordinate with later arriving upgrades and hand over the
+// lock to them, including the the reader-waiting state. We can try to support
+// this if this will actually happen often enough in real workloads.
 
-void
-gtm_rwlock::write_lock ()
+bool
+gtm_rwlock::write_lock_generic (gtm_transaction *tx)
 {
   pthread_mutex_lock (&this->mutex);
 
   unsigned int sum = this->summary;
 
-  // If there is a waiting upgrade, or an active reader or writer, wait.
-  while (sum & (w_upgrade | a_writer | a_reader))
+  // If there is an active writer, wait.
+  while (sum & a_writer)
     {
+      if (tx != 0)
+        {
+          // If this is an upgrade, we must not wait for other writers or
+          // upgrades that already have gone in
+          pthread_mutex_unlock (&this->mutex);
+          return false;
+        }
+
       this->summary = sum | w_writer;
       this->w_writers++;
       pthread_cond_wait (&this->c_writers, &this->mutex);
@@ -96,10 +145,78 @@ gtm_rwlock::write_lock ()
        sum &= ~w_writer;
     }
 
-  // Otherwise we can acquire the lock for write.
+  // Otherwise we can acquire the lock for write. As a writer, we have
+  // priority, so we don't need to take this back.
   this->summary = sum | a_writer;
 
-  pthread_mutex_unlock(&this->mutex);
+  // We still need to wait for active readers to finish. The barrier makes
+  // sure that we first set our write intent and check for active readers
+  // after that, in strictly this order (similar to the barrier in the fast
+  // path of read_lock()).
+  __sync_synchronize();
+
+  // If this is an upgrade, we are not a reader anymore.
+  if (tx != 0)
+    tx->shared_state = ~(typeof tx->shared_state)0;
+
+  // Count the number of active readers to be able to decrease the number of
+  // wake-ups and wait calls that are necessary.
+  //
+  // This number is an upper bound of the number of readers that actually
+  // are still active and which we need to wait for:
+  // - We set our write flag before checking the reader flags, and readers
+  //   check our write flag after clearing their read flags in read_unlock().
+  //   Therefore, they will enter the slow path whenever we have seen them.
+  // - Readers will have cleared their read flags before leaving the slow
+  //   path in read_lock() (prevents lost wake-ups), and before waiting for
+  //   any writer (prevents deadlocks).
+  //
+  // However, this number is also just a lower bound of the number of readers
+  // that will actually enter the slow path in read_unlock() or read_lock():
+  // - Because the read flag is cleared outside of a critical section, writers
+  //   can see it as cleared while the reader still goes into the slow path.
+  //
+  // Therefore, readers can skip (lower bound - 1) wake-ups, but we do need
+  // the following loop to check that the readers that we wanted to wait for
+  // are actually those that entered the slow path so far (and either skipped
+  // or sent a wake-up).
+  //
+  // ??? Do we need to optimize further? (The writer could publish a list of
+  // readers that it suspects to be active. Readers could check this list and
+  // only decrement a_readers if they are in this list.)
+  for (;;)
+    {
+      // ??? Keep a list of active readers that we saw and update it on the
+      // next retry instead? This might reduce the number of cache misses that
+      // we get when checking reader flags.
+      int readers = 0;
+      for (gtm_transaction *it = gtm_transaction::list_of_tx; it != 0;
+          it = it->next_tx)
+        {
+          // Don't count ourself if this is an upgrade.
+          if (it->shared_state != ~(typeof it->shared_state)0)
+            readers++;
+        }
+
+      // If we have not seen any readers, we will not wait.
+      if (readers == 0)
+        break;
+
+      // We've seen a number of readers, so we publish this number and wait.
+      this->a_readers = readers;
+      pthread_cond_wait (&this->c_confirmed_writers, &this->mutex);
+    }
+
+  pthread_mutex_unlock (&this->mutex);
+  return true;
+}
+
+// Acquire a RW lock for writing.
+
+void
+gtm_rwlock::write_lock ()
+{
+  write_lock_generic (0);
 }
 
 
@@ -108,64 +225,41 @@ gtm_rwlock::write_lock ()
 // if this attempt fails (i.e. another thread also upgraded).
 
 bool
-gtm_rwlock::write_upgrade ()
+gtm_rwlock::write_upgrade (gtm_transaction *tx)
 {
-  pthread_mutex_lock (&this->mutex);
-
-  unsigned int sum = this->summary;
-
-  // If there's already someone trying to upgrade, then we fail.
-  if (unlikely (sum & w_upgrade))
-    {
-      pthread_mutex_unlock (&this->mutex);
-      return false;
-    }
-
-  // If there are more active readers, then we have to wait.
-  if (--this->a_readers > 0)
-    {
-      this->summary = sum | w_upgrade;
-      pthread_cond_wait (&this->c_upgrade, &this->mutex);
-      sum = this->summary & ~w_upgrade;
-      // We only return from upgrade when we've got it; don't loop.
-      assert ((sum & (a_reader | a_writer)) == 0);
-    }
-  else
-    {
-      // We were the only reader.
-      sum &= ~a_reader;
-    }
-
-  // Otherwise we can upgrade to writer.
-  this->summary = sum | a_writer;
-
-  pthread_mutex_unlock (&this->mutex);
-  return true;
+  return write_lock_generic (tx);
 }
 
 
 // Release a RW lock from reading.
 
 void
-gtm_rwlock::read_unlock ()
+gtm_rwlock::read_unlock (gtm_transaction *tx)
 {
+  tx->shared_state = ~(typeof tx->shared_state)0;
+  __sync_synchronize();
+  unsigned int sum = this->summary;
+  if (likely(!(sum & (a_writer | w_writer))))
+    return;
+
+  // There is a writer, either active or waiting for other readers or writers.
+  // Thus, enter the mutex-based slow path.
   pthread_mutex_lock (&this->mutex);
 
-  // If there are no more active readers, we may need to wake someone.
-  if (--this->a_readers == 0)
+  // If there is a writer waiting for readers, wake it up. Only do that if we
+  // might be the last reader that could do the wake-up, otherwise skip the
+  // wake-up and decrease a_readers to publish that we have entered the slow
+  // path but skipped the wake-up.
+  if (this->a_readers > 0)
     {
-      unsigned int sum = this->summary;
-      this->summary = sum & ~a_reader;
-
-      // If there is a waiting upgrade, wake it.
-      if (unlikely (sum & w_upgrade))
-       pthread_cond_signal (&this->c_upgrade);
-
-      // If there is a waiting writer, wake it.
-      else if (unlikely (sum & w_writer))
-       pthread_cond_signal (&this->c_writers);
+      this->a_readers--;
+      if (this->a_readers == 0)
+        pthread_cond_signal(&this->c_confirmed_writers);
     }
 
+  // We don't need to wake up any writers waiting for other writers. Active
+  // writers will take care of that.
+
   pthread_mutex_unlock (&this->mutex);
 }
 
diff --git a/libitm/config/posix/rwlock.h b/libitm/config/posix/rwlock.h
index 8554445..b7a51ab 100644
--- a/libitm/config/posix/rwlock.h
+++ b/libitm/config/posix/rwlock.h
@@ -29,29 +29,34 @@
 
 namespace GTM HIDDEN {
 
-// This datastructure is similar to the POSIX pthread_rwlock_t except
+struct gtm_transaction;
+
+// This datastructure is the blocking, mutex-based side of the Dekker-style
+// reader-writer lock used to provide mutual exclusion between active and
+// serial transactions. It has similarities to POSIX pthread_rwlock_t except
 // that we also provide for upgrading a reader->writer lock, with a
 // positive indication of failure (another writer acquired the lock
-// before we were able to acquire).
+// before we were able to acquire). While the writer flag (a_writer below) is
+// global and protected by the mutex, there are per-transaction reader flags,
+// which are stored in a transaction's shared state.
+// See libitm's documentation for further details.
 //
-// In this implementation, rw upgrade is given highest priority access,
-// and writers are given priority over readers.
+// In this implementation, writers are given highest priority access but
+// read-to-write upgrades do not have a higher priority than writers.
 
 class gtm_rwlock
 {
-  pthread_mutex_t mutex;       // Held if manipulating any field.
-  pthread_cond_t c_readers;    // Readers wait here
-  pthread_cond_t c_writers;    // Writers wait here
-  pthread_cond_t c_upgrade;    // An upgrader waits here
+  pthread_mutex_t mutex;               // Held if manipulating any field.
+  pthread_cond_t c_readers;            // Readers wait here
+  pthread_cond_t c_writers;            // Writers wait here for writers
+  pthread_cond_t c_confirmed_writers;  // Writers wait here for readers
 
-  static const unsigned w_upgrade  = 1;                // A reader waiting for 
upgrade
-  static const unsigned a_writer   = 2;                // An active writer.
-  static const unsigned w_writer   = 4;                // The w_writers field 
!= 0
-  static const unsigned a_reader   = 8;                // The a_readers field 
!= 0
-  static const unsigned w_reader   = 16;       // The w_readers field != 0
+  static const unsigned a_writer  = 1; // An active writer.
+  static const unsigned w_writer  = 2; // The w_writers field != 0
+  static const unsigned w_reader  = 4;  // The w_readers field != 0
 
   unsigned int summary;                // Bitmask of the above.
-  unsigned int a_readers;      // Nr active readers
+  unsigned int a_readers;      // Nr active readers as observed by a writer
   unsigned int w_readers;      // Nr waiting readers
   unsigned int w_writers;      // Nr waiting writers
 
@@ -59,13 +64,16 @@ class gtm_rwlock
   gtm_rwlock();
   ~gtm_rwlock();
 
-  void read_lock ();
-  void read_unlock ();
+  void read_lock (gtm_transaction *tx);
+  void read_unlock (gtm_transaction *tx);
 
   void write_lock ();
   void write_unlock ();
 
-  bool write_upgrade ();
+  bool write_upgrade (gtm_transaction *tx);
+
+ protected:
+  bool write_lock_generic (gtm_transaction *tx);
 };
 
 } // namespace GTM
diff --git a/libitm/libitm.texi b/libitm/libitm.texi
index 8d7aef4..5a6582b 100644
--- a/libitm/libitm.texi
+++ b/libitm/libitm.texi
@@ -442,7 +442,262 @@ otherwise (e.g., when a transaction encounters data 
conflicts during
 optimistic execution).
 
 
-@strong{TODO}: SI-mode implementation, method sets, ...
+@section Locking conventions
+
+This section documents the locking scheme and rules for all uses of locking
+in libitm. We have to support serial(-irrevocable) mode, which is implemented
+using a global lock as explained next (called the @emph{serial lock}). To
+simplify the overall design, we use the same lock as catch-all locking
+mechanism for other infrequent tasks such as (de)registering clone tables or
+threads. Besides the serial lock, there are @emph{per-method-set locks} that
+are managed by specific method sets (i.e., groups of similar TM concurrency
+control algorithms), and lock-like constructs for quiescence-based operations
+such as ensuring privatization safety.
+
+Thus, the actions that participate in the libitm-internal locking are either
+@emph{active transactions} that do not run in serial mode, @emph{serial
+transactions} (which (are about to) run in serial mode), and management tasks
+that do not execute within a transaction but have acquired the serial mode
+like a serial transaction would do (e.g., to be able to register threads with
+libitm). Transactions become active as soon as they have successfully used the
+serial lock to announce this globally (@pxref{serial-lock-impl,,Serial lock
+implementation}). Likewise, transactions become serial transactions as soon as
+they have acquired the exclusive rights provided by the serial lock (i.e.,
+serial mode, which also means that there are no other concurrent active or
+serial transactions). Note that active transactions can become serial
+transactions when they enter serial mode during the runtime of the
+transaction.
+
+@subsection State-to-lock mapping
+
+Application data is protected by the serial lock if there is a serial
+transaction and no concurrently running active transaction (i.e., non-serial).
+Otherwise, application data is protected by the currently selected method set,
+which might use per-method-set locks or other mechanisms. Also note that
+application data that is about to be privatized might not be allowed to be
+accessed by nontransactional code until privatization safety has been ensured;
+the details of this are handled by the current method set.
+
+libitm-internal state is either protected by the serial lock or accessed
+through custom concurrent code. The latter applies to the public/shared part
+of a transaction object and most typical method-set-specific state.
+
+The former category (protected by the serial lock) includes:
+@itemize @bullet
+@item The list of active threads that have used transactions.
+@item The tables that map functions to their transactional clones.
+@item The current selection of which method set to use.
+@item Some method-set-specific data, or invariants of this data. For example,
+resetting a method set to its initial state is handled by switching to the
+same method set, so the serial lock protects such resetting as well.
+@end itemize
+In general, such state is immutable whenever there exists an active
+(non-serial) transaction. If there is no active transaction, a serial
+transaction (or a thread that is not currently executing a transaction but has
+acquired the serial lock) is allowed to modify this state (but must of course
+be careful to not surprise the current method set's implementation with such
+modifications).
+
+@subsection Lock acquisition order
+
+To prevent deadlocks, locks acquisition must happen in a globally agreed-upon
+order. Note that this applies to other forms of blocking too, but does not
+necessarily apply to lock acquisitions that do not block (e.g., trylock()
+calls that do not get retried forever). Note that serial transactions are
+never return back to active transactions until the transaction has committed.
+Likewise, active transactions stay active until they have committed.
+Per-method-set locks are typically also not released before commit.
+
+Lock acquisition / blocking rules:
+@itemize @bullet
+
+@item Transactions must become active or serial before they are allowed to
+use method-set-specific locks or blocking (i.e., the serial lock must be
+acquired before those other locks, either in serial or nonserial mode).
+
+@item Any number of threads that do not currently run active transactions can
+block while trying to get the serial lock in exclusive mode. Note that active
+transactions must not block when trying to upgrade to serial mode unless there
+is no other transaction that is trying that (the latter is ensured by the
+serial lock implementation.
+
+@item Method sets must prevent deadlocks on their locks. In particular, they
+must also be prepared for another active transaction that has acquired
+method-set-specific locks but is blocked during an attempt to upgrade to
+being a serial transaction. See below for details.
+
+@item Serial transactions can acquire method-set-specific locks because there
+will be no other active nor serial transaction.
+
+@end itemize
+
+There is no single rule for per-method-set blocking because this depends on
+when a TM method might acquire locks. If no active transaction can upgrade to
+being a serial transaction after it has acquired per-method-set locks (e.g.,
+when those locks are only acquired during an attempt to commit), then the TM
+method does not need to consider a potential deadlock due to serial mode.
+
+If there can be upgrades to serial mode after the acquisition of
+per-method-set locks, then TM methods need to avoid those deadlocks:
+@itemize @bullet
+@item When upgrading to a serial transaction, after acquiring exclusive rights
+to the serial lock but before waiting for concurrent active transactions to
+finish (@pxref{serial-lock-impl,,Serial lock implementation} for details),
+we have to wake up all active transactions waiting on the upgrader's
+per-method-set locks.
+@item Active transactions blocking on per-method-set locks need to check the
+serial lock and abort if there is a pending serial transaction.
+@item Lost wake-ups have to be prevented (e.g., by changing a bit in each
+per-method-set lock before doing the wake-up, and only blocking on this lock
+using a futex if this bit is not set).
+@end itemize
+
+@strong{TODO}: Can reuse serial lock for gl-*? And if we can, does it make
+sense to introduce further complexity in the serial lock? For gl-*, we can
+really only avoid an abort if we do -wb and -vbv.
+
+
+@subsection Serial lock implementation
+@anchor{serial-lock-impl}
+
+The serial lock implementation is optimized towards assuming that serial
+transactions are infrequent and not the common case. However, the performance
+of entering serial mode can matter because when only few transactions are run
+concurrently or if there are few threads, then it can be efficient to run
+transactions serially.
+
+The serial lock is similar to a multi-reader-single-writer lock in that there
+can be several active transactions but only one serial transaction. However,
+we do want to avoid contention (in the lock implementation) between active
+transactions, so we split up the reader side of the lock into per-transaction
+flags that are true iff the transaction is active. The exclusive writer side
+remains a shared single flag, which is acquired using a CAS, for example.
+On the fast-path, the serial lock then works similar to Dekker's algorithm but
+with several reader flags that a serial transaction would have to check.
+A serial transaction thus requires a list of all threads with potentially
+active transactions; we can use the serial lock itself to protect this list
+(i.e., only threads that have acquired the serial lock can modify this list).
+
+We want starvation-freedom for the serial lock to allow for using it to ensure
+progress for potentially starved transactions (@pxref{progress-guarantees,,
+Progress Guarantees} for details). However, this is currently not enforced by
+the implementation of the serial lock.
+
+Here is pseudo-code for the read/write fast paths of acquiring the serial
+lock (read-to-write upgrade is similar to write_lock:
+@example
+// read_lock:
+tx->shared_state |= active;
+__sync_synchronize(); // or STLD membar, or C++0x seq-cst fence
+while (!serial_lock.exclusive)
+  if (spinning_for_too_long) goto slowpath;
+
+// write_lock:
+if (CAS(&serial_lock.exclusive, 0, this) != 0)
+  goto slowpath; // writer-writer contention
+// need a membar here, but CAS already has full membar semantics
+bool need_blocking = false;
+for (t: all txns)
+  @{
+    for (;t->shared_state & active;)
+      if (spinning_for_too_long) @{ need_blocking = true; break; @}
+  @}
+if (need_blocking) goto slowpath;
+@end example
+
+Releasing a lock in this spin-lock version then just consists of resetting
+@code{tx->shared_state} to inactive or clearing @code{serial_lock.exclusive}.
+
+However, we can't rely on a pure spinlock because we need to get the OS
+involved at some time (e.g., when there are more threads than CPUs to run on).
+Therefore, the real implementation falls back to a blocking slow path, either
+based on pthread mutexes or Linux futexes.
+
+
+@subsection Reentrancy
+
+libitm has to consider the following cases of reentrancy:
+@itemize @bullet
+
+@item Transaction calls unsafe code that starts a new transaction: The outer
+transaction will become a serial transaction before executing unsafe code.
+Therefore, nesting within serial transactions must work, even if the nested
+transaction is called from within uninstrumented code.
+
+@item Transaction calls either a transactional wrapper or safe code, which in
+turn starts a new transaction: It is not yet defined in the specification
+whether this is allowed. Thus, it is undefined whether libitm supports this.
+
+@item Code that starts new transactions might be called from within any part
+of libitm: This kind of reentrancy would likely be rather complex and can
+probably be avoided. Therefore, it is not supported.
+
+@end itemize
+
+@subsection Privatization safety
+
+Privatization safety is ensured by libitm using a quiescence-based approach.
+Basically, a privatizing transaction waits until all concurrent active
+transactions will either have finished (are not active anymore) or operate on
+a sufficiently recent snapshot to not access the privatized data anymore. This
+happens after the privatizing transaction has stopped being an active
+transaction, so waiting for quiescence does not contribute to deadlocks.
+
+In method sets that need to ensure publication safety explicitly, active
+transactions maintain a flag or timestamp in the public/shared part of the
+transaction descriptor. Before blocking, privatizers need to let the other
+transactions know that they should wake up the privatizer.
+
+@strong{TODO} Ho to implement the waiters? Should those flags be
+per-transaction or at a central place? We want to avoid one wake/wait call
+per active transactions, so we might want to use either a tree or combining
+to reduce the syscall overhead, or rather spin for a long amount of time
+instead of doing blocking. Also, it would be good if only the last transaction
+that the privatizer waits for would do the wake-up.
+
+@subsection Progress guarantees
+@anchor{progress-guarantees}
+
+Transactions that do not make progress when using the current TM method will
+eventually try to execute in serial mode. Thus, the serial lock's progress
+guarantees determine the progress guarantees of the whole TM. Obviously, we at
+least need deadlock-freedom for the serial lock, but it would also be good to
+provide starvation-freedom (informally, all threads will finish executing a
+transaction eventually iff they get enough cycles).
+
+However, the scheduling of transactions (e.g., thread scheduling by the OS)
+also affects the handling of progress guarantees by the TM. First, the TM
+can only guarantee deadlock-freedom if threads do not get stopped. Likewise,
+low-priority threads can starve if they do not get scheduled when other
+high-priority threads get those cycles instead.
+
+If all threads get scheduled eventually, correct lock implementations will
+provide deadlock-freedom, but might not provide starvation-freedom. We can
+either enforce the latter in the TM's lock implementation, or assume that
+the scheduling is sufficiently random to yield a probabilistic guarantee that
+no thread will starve (because eventually, a transaction will encounter a
+scheduling that will allow it to run). This can indeed work well in practice
+but is not necessarily guaranteed to work (e.g., simple spin locks can be
+pretty efficient).
+
+Because enforcing stronger progress guarantees in the TM has a higher runtime
+overhead, we focus on deadlock-freedom right now and assume that the threads
+will get scheduled eventually by the OS (but don't consider threads with
+different priorities). We should support starvation-freedom for serial
+transactions in the future. Everything beyond that is highly related to proper
+contention management across all of the TM (including with TM method to
+choose), and is future work.
+
+@strong{TODO} Handling thread priorities: We want to avoid priority inversion
+but it's unclear how often that actually matters in practice. Workloads that
+have threads with different priorities will likely also require lower latency
+or higher throughput for high-priority threads. Therefore, it probably makes
+not that much sense (except for eventual progress guarantees) to use
+priority inheritance until the TM has priority-aware contention management.
+
+@strong{TODO}: method sets, MS switch protocol and state (re)init after
+switch (e.g., clocks and counters),...
+
 
 @c ---------------------------------------------------------------------
 @c GNU General Public License
diff --git a/libitm/libitm_i.h b/libitm/libitm_i.h
index e2445c7..7d5c345 100644
--- a/libitm/libitm_i.h
+++ b/libitm/libitm_i.h
@@ -199,6 +199,10 @@ struct gtm_transaction
   // Points to the next transaction in the list of all threads' transactions.
   gtm_transaction *next_tx __attribute__((__aligned__(HW_CACHELINE_SIZE)));
 
+  // If this transaction is inactive, shared_state is ~0. Otherwise, this is
+  // an active or serial transaction.
+  gtm_word shared_state;
+
   // The lock that provides access to serial mode.  Non-serialized
   // transactions acquire read locks; a serialized transaction aquires
   // a write lock.
@@ -207,6 +211,9 @@ struct gtm_transaction
   // The head of the list of all threads' transactions.
   static gtm_transaction *list_of_tx;
 
+  gtm_transaction() : shared_state(~(typeof shared_state)0)
+  {}
+
   // In alloc.cc
   void commit_allocations (bool, aa_tree<uintptr_t, gtm_alloc_action>*);
   void record_allocation (void *, void (*)(void *));
diff --git a/libitm/method-serial.cc b/libitm/method-serial.cc
index 430f004..8f29aba 100644
--- a/libitm/method-serial.cc
+++ b/libitm/method-serial.cc
@@ -241,7 +241,7 @@ GTM::gtm_transaction::serialirr_mode ()
       disp->fini ();
       need_restart = false;
     }
-  else if (serial_lock.write_upgrade ())
+  else if (serial_lock.write_upgrade (this))
     {
       this->state |= STATE_SERIAL;
       if (disp->trycommit ())
diff --git a/libitm/retry.cc b/libitm/retry.cc
index 5b3bf90..6c10d93 100644
--- a/libitm/retry.cc
+++ b/libitm/retry.cc
@@ -52,7 +52,7 @@ GTM::gtm_transaction::decide_retry_strategy 
(gtm_restart_reason r)
       if ((this->state & STATE_SERIAL) == 0)
        {
          this->state |= STATE_SERIAL;
-         serial_lock.read_unlock ();
+         serial_lock.read_unlock (this);
          serial_lock.write_lock ();
        }

[trans-mem] Use per-transaction reader flags for the serial lock

Reply via email to