Re: [PR] [SPARK-55978][SQL] Add TABLESAMPLE SYSTEM block sampling with DSv2 pushdown [spark]

via GitHub Mon, 11 May 2026 11:34:59 -0700


cloud-fan commented on code in PR #54972:
URL: https://github.com/apache/spark/pull/54972#discussion_r3221254647



##########
sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala:
##########
@@ -1039,7 +1039,16 @@ abstract class SparkStrategies extends 
QueryPlanner[SparkPlan] {
         execution.FilterExec(f.typedCondition(f.deserializer), 
planLater(f.child)) :: Nil
       case e @ logical.Expand(_, _, child) =>
         execution.ExpandExec(e.projections, e.output, planLater(child)) :: Nil
-      case logical.Sample(lb, ub, withReplacement, seed, child) =>
+      case logical.Sample(lb, ub, withReplacement, seed, child, sampleMethod) 
=>
+        if (sampleMethod == logical.SampleMethod.System) {

Review Comment:
   This branch is unreachable in normal flow — `V2ScanRelationPushDown` is 
non-excludable and always handles SYSTEM samples (either removes the Sample 
node or throws). Reaching it would indicate an internal invariant violation 
(e.g. a custom `Optimizer` subclass that skipped `V2ScanRelationPushDown`), not 
a user error, so a `SparkException.internalError` would be more accurate than a 
user-facing `AnalysisException`.
   
   The comment is also a bit misleading: it says "fell through to row-based 
sampling" but the actual cause if this is ever reached is "the SYSTEM-handling 
optimizer rule didn't run as expected".



##########
sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/SupportsPushDownTableSample.java:
##########
@@ -29,11 +29,28 @@
 public interface SupportsPushDownTableSample extends ScanBuilder {
 
   /**
-   * Pushes down SAMPLE to the data source.
+   * Pushes down BERNOULLI (row-level) SAMPLE to the data source.
    */
   boolean pushTableSample(
       double lowerBound,
       double upperBound,
       boolean withReplacement,
       long seed);
+
+  /**
+   * Pushes down SAMPLE to the data source with the specified sampling method.
+   */
+  default boolean pushTableSample(
+      double lowerBound,
+      double upperBound,
+      boolean withReplacement,
+      long seed,
+      SampleMethod sampleMethod) {
+    if (sampleMethod == SampleMethod.SYSTEM) {
+      // If the data source hasn't overridden this method, it must have not 
added support

Review Comment:
   Word order — "must not have" reads more naturally than "must have not".
   
   ```suggestion
         // If the data source hasn't overridden this method, it must not have 
added support
   ```



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala:
##########
@@ -2448,10 +2448,14 @@ class AstBuilder extends DataTypeAstBuilder
    * - TABLESAMPLE(x ROWS): Sample the table down to the given number of rows.
    * - TABLESAMPLE(x PERCENT) [REPEATABLE (y)]: Sample the table down to the 
given percentage with
    * seed 'y'. Note that percentages are defined as a number between 0 and 100.
+   * - TABLESAMPLE SYSTEM(x PERCENT): Sample by data source dependent blocks 
or file splits.

Review Comment:
   Compound modifier should be hyphenated.
   
   ```suggestion
      * - TABLESAMPLE SYSTEM(x PERCENT): Sample by data-source-dependent blocks 
or file splits.
   ```



##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanRelationPushDown.scala:
##########
@@ -838,6 +844,33 @@ object V2ScanRelationPushDown extends Rule[LogicalPlan] 
with PredicateHelper {
   }
 
   def pushDownSample(plan: LogicalPlan): LogicalPlan = plan.transform {
+    case sample: Sample if sample.sampleMethod == SampleMethod.System =>

Review Comment:
   The SYSTEM and BERNOULLI branches share most of the structure — 
`TableSampleInfo` construction, the `PhysicalOperation(_, Nil, sHolder)` 
pattern, and the `PushDownUtils.pushTableSample` call — and differ only in 
failure handling. Worth consolidating into a single `case sample: Sample => 
sample.child match { ... }` that branches on `sampleMethod` only for the throw 
vs. fall-back decision.
   
   As a side effect, the two paths have already started to drift: the seed-TODO 
comment lives only on the SYSTEM branch even though both branches use the same 
`* 1000` multiplier.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-55978][SQL] Add TABLESAMPLE SYSTEM block sampling with DSv2 pushdown [spark]

Reply via email to