cloud-fan commented on code in PR #55711:
URL: https://github.com/apache/spark/pull/55711#discussion_r3273233880


##########
sql/core/src/test/scala/org/apache/spark/sql/connector/MergeIntoTableSuiteBase.scala:
##########
@@ -2663,6 +2664,57 @@ abstract class MergeIntoTableSuiteBase extends 
RowLevelOperationSuiteBase
     }
   }
 
+  test("metric values are stable across stage retries") {
+    // The join in the MERGE plan introduces a shuffle (with broadcast 
disabled). The
+    // DAGScheduler corrupts the first attempt of every upstream shuffle map 
stage, forcing
+    // the MergeRowsExec stage to retry. With plain SQLMetrics the row 
counters would double
+    // up across attempts, but SQLLastAttemptMetric reports only the last 
attempt, so the
+    // values surfaced via `MergeSummary` remain correct.

Review Comment:
   Per your own TODO in the conversation thread, the current fetch-failure 
injection doesn't actually retry the `MergeRowsExec`/writer stage — only 
upstream shuffle map stages retry. With plain `SQLMetric`, the `numTargetRows*` 
counters wouldn't double up under this injection, so this test passes equally 
well without the SLAM swap. Suggested rewording:
   
   ```suggestion
       // The join in the MERGE plan introduces a shuffle (with broadcast 
disabled), and the
       // DAGScheduler corrupts the first attempt of every upstream shuffle map 
stage. Note:
       // the current fetch-failure injection does not retry the 
MergeRowsExec/writer stage,
       // so this test passes equally well with plain SQLMetric — it only 
exercises the
       // SLAM-aware read path. Follow-up #55738 will add infra to actually 
retry the writer
       // stage and exercise the SLAM behavior end-to-end for MERGE.
   ```



##########
sql/core/src/test/scala/org/apache/spark/sql/connector/UpdateTableSuiteBase.scala:
##########
@@ -340,6 +341,46 @@ abstract class UpdateTableSuiteBase extends 
RowLevelOperationSuiteBase {
     checkUpdateMetrics(numUpdatedRows = 2, numCopiedRows = 1)
   }
 
+  test("metric values are stable across stage retries") {
+    // Force a shuffle in the UPDATE plan via an IN-subquery (with broadcast 
disabled), then
+    // have the DAGScheduler corrupt the first attempt of every upstream 
shuffle map stage so
+    // the writer stage has to retry. With a plain SQLMetric the row counters 
would double up
+    // across attempts, but SQLLastAttemptMetric reports only the last 
attempt, so the values
+    // surfaced via `UpdateSummary` remain correct.

Review Comment:
   Same point as on the MERGE test: per the TODO in the conversation thread, 
the writer stage doesn't actually retry under the current fetch-failure 
injection, so `numUpdatedRows` / `numCopiedRows` wouldn't double up with plain 
`SQLMetric` either. Suggested rewording:
   
   ```suggestion
       // Force a shuffle in the UPDATE plan via an IN-subquery (with broadcast 
disabled), then
       // have the DAGScheduler corrupt the first attempt of every upstream 
shuffle map stage.
       // Note: the current fetch-failure injection does not retry the writer 
stage, so this
       // test passes equally well with plain SQLMetric — it only exercises the 
SLAM-aware
       // read path. Follow-up #55738 will add infra to actually retry the 
writer stage and
       // exercise the SLAM behavior end-to-end for UPDATE.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to