** Description changed:

  SRU Justification
  
  [Impact]
  
  NVIDIA runs into an issue building DGX Cloud GB300 clusters. The
  dst_fetch_ha() checks nud_state without holding the neighbor lock may
  result on the incorrect/zero mac being read and programmed in the device
  QP while it was not yet updated. This causes silent packet loss and
  eventual RETRY_EXC_ERR.
  
  The upstream commit 5e6de34d82b4 (“IB/core: Fix zero dmac race in
  neighbor resolution”) solved this problem.
  
  [Fix]
  
  Questing-aws, Resolute-aws and Noble-aws-6.17:
  
  * cherry-pick 5e6de34d82b4 ("IB/core: Fix zero dmac race in neighbor
  resolution")
  
  [Test Plan]
  
  * Compile and boot test.
  
- * NVIDIA/AWS test.
+ * AWS test.
  
  [Regression potential]
  
- * The commit add a lock in one line. Regression likliehood is slim.
+ * Regression likelihood is slim: the commit adds a lock in one line to
+ hold nud_state until the copy of ha under seqlock.
  
  [Other info]
  
  * SF#00437493

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/2154023

Title:
  DGX Cloud GB300 clusters fail to build due zero dmac race in neighbor
  resolution

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux-aws/+bug/2154023/+subscriptions


-- 
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to