Re: [PR] [SPARK-56940][SQL] Extend OptimizeRand optimizer to support arithmetic expressions [spark]

via GitHub Thu, 21 May 2026 05:02:04 -0700


yaooqinn commented on code in PR #55982:
URL: https://github.com/apache/spark/pull/55982#discussion_r3280970760



##########
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeRandSuite.scala:
##########
@@ -173,4 +176,95 @@ class OptimizeRandSuite extends PlanTest {
     }
   }
 
+  test("Optimize arithmetic expressions with rand") {
+    // rand() * 3 < 3 should be optimized to true
+    val plan1 = testRelation.select((rand5 * literal3d < 
literal3d).as("flag")).analyze
+    val actual1 = Optimize.execute(plan1)
+    val correctAnswer1 = testRelation.select(Alias(TrueLiteral, 
"flag")()).analyze
+    comparePlans(actual1, correctAnswer1)
+
+    // rand() + 1 < 2 should be optimized to true
+    val plan2 = testRelation.select((rand5 + literal1d < 
literal2d).as("flag")).analyze
+    val actual2 = Optimize.execute(plan2)
+    val correctAnswer2 = testRelation.select(Alias(TrueLiteral, 
"flag")()).analyze
+    comparePlans(actual2, correctAnswer2)
+
+    // rand() - 1 < 0 should be optimized to true
+    val plan3 = testRelation.select((rand5 - literal1d < 
literal0d).as("flag")).analyze
+    val actual3 = Optimize.execute(plan3)
+    val correctAnswer3 = testRelation.select(Alias(TrueLiteral, 
"flag")()).analyze
+    comparePlans(actual3, correctAnswer3)
+
+    // rand() / 2 < 1 should be optimized to true
+    val plan4 = testRelation.select((rand5 / literal2d < 
literal1d).as("flag")).analyze
+    val actual4 = Optimize.execute(plan4)
+    val correctAnswer4 = testRelation.select(Alias(TrueLiteral, 
"flag")()).analyze
+    comparePlans(actual4, correctAnswer4)
+
+    // rand() * 2 > 3 should be optimized to false
+    val plan5 = testRelation.select((rand5 * literal2d > 
literal3d).as("flag")).analyze
+    val actual5 = Optimize.execute(plan5)
+    val correctAnswer5 = testRelation.select(Alias(FalseLiteral, 
"flag")()).analyze
+    comparePlans(actual5, correctAnswer5)
+  }
+
+  test("Optimize equality comparison with rand") {
+    // rand() == 0.5 cannot be optimized (value is in [0, 1) range)
+    val plan1 = testRelation.select((rand5 === literalHalf).as("flag")).analyze
+    val actual1 = Optimize.execute(plan1)
+    comparePlans(actual1, plan1)
+
+    // rand() == 2 should be optimized to false (value outside [0, 1) range)
+    val plan2 = testRelation.select((rand5 === literal2d).as("flag")).analyze
+    val actual2 = Optimize.execute(plan2)
+    val correctAnswer2 = testRelation.select(Alias(FalseLiteral, 
"flag")()).analyze
+    comparePlans(actual2, correctAnswer2)
+
+    // rand() == -1 should be optimized to false (value outside [0, 1) range)
+    val plan3 = testRelation.select((rand5 === 
negativeLiteral1d).as("flag")).analyze
+    val actual3 = Optimize.execute(plan3)
+    val correctAnswer3 = testRelation.select(Alias(FalseLiteral, 
"flag")()).analyze
+    comparePlans(actual3, correctAnswer3)
+
+    // 2 == rand() should be optimized to false (literal on left side)
+    val plan4 = testRelation.select((literal2d === rand5).as("flag")).analyze
+    val actual4 = Optimize.execute(plan4)
+    val correctAnswer4 = testRelation.select(Alias(FalseLiteral, 
"flag")()).analyze
+    comparePlans(actual4, correctAnswer4)
+
+    // -1 == rand() should be optimized to false (literal on left side)
+    val plan5 = testRelation.select((negativeLiteral1d === 
rand5).as("flag")).analyze
+    val actual5 = Optimize.execute(plan5)
+    val correctAnswer5 = testRelation.select(Alias(FalseLiteral, 
"flag")()).analyze
+    comparePlans(actual5, correctAnswer5)
+  }
+
+  test("Benchmark: rand optimization performance benefit") {

Review Comment:
   Two issues with this benchmark:
   
   1. Inverted comparison. Both branches re-run `.analyze`; the only delta is 
that the `optimized` branch additionally runs `Optimize.execute`. So 
`optimizedTime > unoptimizedTime` by construction, and `improvement = (un - 
opt) / un * 100` will print a *negative* percentage — the opposite of what the 
PR claims to demonstrate.
   
   2. The PR's claim is that query *execution* is faster (constant-folded 
`Filter(true)` skips per-row `rand()` evaluation in the codegened operator), 
not that *planning* is faster. A planning-time microbench can't observe that.
   
   If you want to demonstrate the perf benefit, please use the standard 
benchmark harness (`SqlBasedBenchmark` under 
`sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/`, e.g. 
`AnalyzerBenchmark.scala`) and measure end-to-end query time on a real 
DataFrame, rather than `System.nanoTime` + `println` in the unit suite.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-56940][SQL] Extend OptimizeRand optimizer to support arithmetic expressions [spark]

Reply via email to