Matt Sinclair has submitted this change. ( https://gem5-review.googlesource.com/c/public/gem5/+/51367 )

Change subject: mem-ruby: Update GPU VIPER TCC protocol to resolve deadlock
......................................................................

mem-ruby: Update GPU VIPER TCC protocol to resolve deadlock

In the GPU VIPER TCC, programs with mixes of atomics and data
accesses to the same address, in the same kernel, can experience
deadlock when large applications (e.g., Pannotia's graph analytics
algorithms) are running on very small GPUs (e.g., the default 4 CU GPU
configuration).  In this situation, deadlocks occur due to resource
stalls interacting with the behavior of the current implementation for
handling races between atomic accesses.  The specific order of events
causing this deadlock are:

1. TCC is waiting on an atomic to return from directory

2. In the meantime it receives another atomic to the same address -- when
this happens, the TCC increments number of atomics to this address
(numAtomics = 2) that are pending in TBE, and does a write through of the
atomic to the directory.

3. When the first atomic returns from the Directory, it decrements the
numAtomics counter.  numAtomics was at 2 though, because of step #2.  So
it doesn't deallocate the TBE entry and calls Event:AtomicNotDone.

4. Another request (a LD) to the same address comes along for the same
address.  The LD does z_stall since the second atomic is pending –- so the
LD retries every cycle until the deadlock counter times out (or until the
second atomic comes back).

5.  The second atomic returns to the TCC.  However, because there are so
many LD's pending in the cache, all doing z_stall's and retrying every cycle, there are a lot of resource stalls. So, when the second atomic returns, it is
forced to retry its operation multiple times -- and each time it decrements
the atomicDoneCnt flag (which was added to catch a race between atomics
arriving and leaving the TCC in 7246f70bfb) repeatedly.  As a result
atomicDoneCnt becomes negative.

6.  Since this atomicDoneCnt flag is used to determine when Event:AtomicDone
happens, and since the resource stalls caused the atomicDoneCnt flag to become negative, we never complete the atomic. Which means the pending LD can never
access the line, because it's stuck waiting for the atomic to complete.

7.  Eventually the deadlock threshold is reached.

To fix this issue, this commit changes the VIPER TCC protocol from using
z_stall to using the stall_and_wait buffer method that the
Directory-level of the SLICC already uses.  This change effectively
prevents resource stalls from dominating the TCC level, by putting
pending requests for a given address in a per-address stall buffer.
These requests are then woken up when the pending request returns.

As part of this change, this change also makes two small changes to the
Directory-level protocol (MOESI_AMD_BASE-dir):

1.  Updated the names of the wakeup actions to match the TCC wakeup actions,
to avoid confusion.

2. Changed transition(B, UnblockWriteThrough, U) to check all stall buffers,
as some requests were being placed later in the stall buffer than was
being checked.  This mirrors the changes in 187c44fe44 to other Directory
transitions to resolve races between GPU and DMA requests, but for
transitions prior workloads did not stress.

Change-Id: I60ac9830a87c125e9ac49515a7fc7731a65723c2
Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/51367
Reviewed-by: Jason Lowe-Power <power...@gmail.com>
Reviewed-by: Matthew Poremba <matthew.pore...@amd.com>
Maintainer: Jason Lowe-Power <power...@gmail.com>
Tested-by: kokoro <noreply+kok...@google.com>
---
M src/mem/ruby/protocol/MOESI_AMD_Base-dir.sm
M src/mem/ruby/protocol/GPU_VIPER-TCC.sm
2 files changed, 109 insertions(+), 15 deletions(-)

Approvals:
Jason Lowe-Power: Looks good to me, but someone else must approve; Looks good to me, approved
  Matthew Poremba: Looks good to me, approved
  kokoro: Regressions pass




diff --git a/src/mem/ruby/protocol/GPU_VIPER-TCC.sm b/src/mem/ruby/protocol/GPU_VIPER-TCC.sm
index 6c07416..6112f38 100644
--- a/src/mem/ruby/protocol/GPU_VIPER-TCC.sm
+++ b/src/mem/ruby/protocol/GPU_VIPER-TCC.sm
@@ -126,6 +126,7 @@
   void unset_tbe();
   void wakeUpAllBuffers();
   void wakeUpBuffers(Addr a);
+  void wakeUpAllBuffers(Addr a);

   MachineID mapAddressToMachine(Addr addr, MachineType mtype);

@@ -569,6 +570,14 @@
     probeNetwork_in.dequeue(clockEdge());
   }

+ action(st_stallAndWaitRequest, "st", desc="Stall and wait on the address") {
+    stall_and_wait(coreRequestNetwork_in, address);
+  }
+
+ action(wada_wakeUpAllDependentsAddr, "wada", desc="Wake up any requests waiting for this address") {
+    wakeUpAllBuffers(address);
+  }
+
   action(z_stall, "z", desc="stall") {
       // built-in
   }
@@ -606,13 +615,22 @@
   // they can cause a resource stall deadlock!

transition(WI, {RdBlk, WrVicBlk, Atomic, WrVicBlkBack}) { //TagArrayRead} {
-      z_stall;
+ // put putting the stalled requests in a buffer, we reduce resource contention + // since they won't try again every cycle and will instead only try again once
+      // woken up
+      st_stallAndWaitRequest;
   }
   transition(A, {RdBlk, WrVicBlk, WrVicBlkBack}) { //TagArrayRead} {
-      z_stall;
+ // put putting the stalled requests in a buffer, we reduce resource contention + // since they won't try again every cycle and will instead only try again once
+      // woken up
+      st_stallAndWaitRequest;
   }
   transition(IV, {WrVicBlk, Atomic, WrVicBlkBack}) { //TagArrayRead} {
-      z_stall;
+ // put putting the stalled requests in a buffer, we reduce resource contention + // since they won't try again every cycle and will instead only try again once
+      // woken up
+      st_stallAndWaitRequest;
   }
   transition({M, V}, RdBlk) {TagArrayRead, DataArrayRead} {
     p_profileHit;
@@ -660,9 +678,10 @@

   transition(A, Atomic) {
     p_profileMiss;
-    at_atomicThrough;
-    ina_incrementNumAtomics;
-    p_popRequestQueue;
+ // put putting the stalled requests in a buffer, we reduce resource contention + // since they won't try again every cycle and will instead only try again once
+    // woken up
+    st_stallAndWaitRequest;
   }

   transition({M, W}, Atomic, WI) {TagArrayRead} {
@@ -750,6 +769,7 @@
     wcb_writeCacheBlock;
     sdr_sendDataResponse;
     pr_popResponseQueue;
+    wada_wakeUpAllDependentsAddr;
     dt_deallocateTBE;
   }

@@ -762,6 +782,7 @@

   transition(A, AtomicDone, I) {TagArrayRead, TagArrayWrite} {
     dt_deallocateTBE;
+    wada_wakeUpAllDependentsAddr;
     ptr_popTriggerQueue;
   }

diff --git a/src/mem/ruby/protocol/MOESI_AMD_Base-dir.sm b/src/mem/ruby/protocol/MOESI_AMD_Base-dir.sm
index 4d24891..0138db3 100644
--- a/src/mem/ruby/protocol/MOESI_AMD_Base-dir.sm
+++ b/src/mem/ruby/protocol/MOESI_AMD_Base-dir.sm
@@ -1092,15 +1092,15 @@
     stall_and_wait(dmaRequestQueue_in, address);
   }

- action(wa_wakeUpDependents, "wa", desc="Wake up any requests waiting for this address") { + action(wad_wakeUpDependents, "wad", desc="Wake up any requests waiting for this address") {
     wakeUpBuffers(address);
   }

- action(wa_wakeUpAllDependents, "waa", desc="Wake up any requests waiting for this region") { + action(wa_wakeUpAllDependents, "wa", desc="Wake up any requests waiting for this region") {
     wakeUpAllBuffers();
   }

- action(wa_wakeUpAllDependentsAddr, "waaa", desc="Wake up any requests waiting for this address") { + action(wada_wakeUpAllDependentsAddr, "wada", desc="Wake up any requests waiting for this address") {
     wakeUpAllBuffers(address);
   }

@@ -1206,7 +1206,7 @@
     d_writeDataToMemory;
     al_allocateL3Block;
pr_profileL3HitMiss; //Must come after al_allocateL3Block and before dt_deallocateTBE
-    wa_wakeUpDependents;
+    wad_wakeUpDependents;
     dt_deallocateTBE;
     pr_popResponseQueue;
   }
@@ -1232,12 +1232,12 @@
   }

   transition({B}, CoreUnblock, U) {
-    wa_wakeUpAllDependentsAddr;
+    wada_wakeUpAllDependentsAddr;
     pu_popUnblockQueue;
   }

   transition(B, UnblockWriteThrough, U) {
-    wa_wakeUpDependents;
+    wada_wakeUpAllDependentsAddr;
     pt_popTriggerQueue;
   }

@@ -1280,7 +1280,7 @@
   transition(BDR_M, MemData, U) {
     mt_writeMemDataToTBE;
     dd_sendResponseDmaData;
-    wa_wakeUpAllDependentsAddr;
+    wada_wakeUpAllDependentsAddr;
     dt_deallocateTBE;
     pm_popMemQueue;
   }
@@ -1365,7 +1365,7 @@
   transition(BDW_P, ProbeAcksComplete, U) {
// Check for pending requests from the core we put to sleep while waiting
     // for a response
-    wa_wakeUpAllDependentsAddr;
+    wada_wakeUpAllDependentsAddr;
     dt_deallocateTBE;
     pt_popTriggerQueue;
   }
@@ -1374,7 +1374,7 @@
     dd_sendResponseDmaData;
// Check for pending requests from the core we put to sleep while waiting
     // for a response
-    wa_wakeUpAllDependentsAddr;
+    wada_wakeUpAllDependentsAddr;
     dt_deallocateTBE;
     pt_popTriggerQueue;
   }

--
To view, visit https://gem5-review.googlesource.com/c/public/gem5/+/51367
To unsubscribe, or for help writing mail filters, visit https://gem5-review.googlesource.com/settings

Gerrit-Project: public/gem5
Gerrit-Branch: develop
Gerrit-Change-Id: I60ac9830a87c125e9ac49515a7fc7731a65723c2
Gerrit-Change-Number: 51367
Gerrit-PatchSet: 2
Gerrit-Owner: Matt Sinclair <mattdsincl...@gmail.com>
Gerrit-Reviewer: Jason Lowe-Power <ja...@lowepower.com>
Gerrit-Reviewer: Jason Lowe-Power <power...@gmail.com>
Gerrit-Reviewer: Matt Sinclair <mattdsincl...@gmail.com>
Gerrit-Reviewer: Matthew Poremba <matthew.pore...@amd.com>
Gerrit-Reviewer: kokoro <noreply+kok...@google.com>
Gerrit-CC: Kyle Roarty <kyleroarty1...@gmail.com>
Gerrit-MessageType: merged
_______________________________________________
gem5-dev mailing list -- gem5-dev@gem5.org
To unsubscribe send an email to gem5-dev-le...@gem5.org
%(web_page_url)slistinfo%(cgiext)s/%(_internal_name)s

Reply via email to