[jira] [Created] (SPARK-55978) Block/Split Sampling SQL Augmentation

Stanley Yao (Jira) Thu, 12 Mar 2026 11:28:23 -0700

Stanley Yao created SPARK-55978:
-----------------------------------

             Summary: Block/Split Sampling SQL Augmentation
                 Key: SPARK-55978
                 URL: https://issues.apache.org/jira/browse/SPARK-55978
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 4.1.1
            Reporter: Stanley Yao
             Fix For: 4.1.1



h2. TL;DR
 * Points out a gap for fast development exploration via efficient table 
sampling in Spark.
 * Propose a SQL language augmentation that fills the gap and brings a few more 
benefits.
 * We’d like to gather feedback about the proposal.
 * This doc intentionally omitted technical design due to its dependency on 
language syntax. After we align on the language augmentation, we will present 
the detailed design and implementation.  

h2. Problem Statement

Users running iterative development workflows (e.g., testing queries, tuning 
joins, validating schemas) routinely scan full tables even when a small 
fraction of the data would suffice. For a 10 TB table with 50,000 splits, a 
developer who only needs to verify output shape or spot-check results is forced 
to read all 50,000 splits. All the existing sampling methods read the entire 
input data and discard at the sampling step. Users could workaround it by 
manually creating a down-sampled table, but it's a time-consuming and clunky 
process and doesn’t support multiple sample rates.This proposal fills the gap 
by offering a method to skip some IO based on the table metadata.
h2. Current Situation

This is the current Spark sampling clause:

TABLESAMPLE ( {

    percentage PERCENT

    | num_rows ROWS

    | BUCKET fraction OUT OF total } )

[ REPEATABLE ( seed ) ]

 
|*Feature*| *Apache Spark*|*ANSI SQL (SQL:2003+)*|*Presto/Trino* |
|*Row level random*|TABLESAMPLE (n PERCENT)|TABLESAMPLE BERNOULLI (n 
PERCENT)|TABLESAMPLE BERNOULLI (n)|
|*Block/split level*|{color:#de350b}*n/a*{color}|TABLESAMPLE SYSTEM (n 
PERCENT)|TABLESAMPLE SYSTEM (n)|
|*Hash-based bucketing*|TABLESAMPLE (BUCKET x OUT OF y)|n/a|n/a|
|*Exact row count*|TABLESAMPLE (n ROWS)|n/a|n/a|
|*Repeatable seed*|REPEATABLE (seed)|REPEATABLE (seed)|n/a|

 
h2. Proposed Solution

We are proposing to add one more sampling method extension highlighted green 
below.

TABLESAMPLE *[ SYSTEM | BERNOULLI | <other_extensions> ]* ( {

    percentage PERCENT

    | num_rows ROWS

    | BUCKET fraction OUT OF total } )

[ REPEATABLE ( seed ) ]




|*Feature*| *Apache Spark*|*ANSI SQL (SQL:2003+)*|*Presto/Trino* |
|*Row level random*|TABLESAMPLE (n PERCENT)|TABLESAMPLE BERNOULLI (n 
PERCENT)|TABLESAMPLE BERNOULLI (n)|
|*Block/split level*|{color:#00875a}*TABLESAMPLE SYSTEM (n 
PERCENT)*{color}|TABLESAMPLE SYSTEM (n PERCENT)|TABLESAMPLE SYSTEM (n)|
|*Hash-based bucketing*|TABLESAMPLE (BUCKET x OUT OF y)|n/a|n/a|
|*Exact row count*|TABLESAMPLE (n ROWS)|n/a|n/a|
|*Repeatable seed*|REPEATABLE (seed)|REPEATABLE (seed)|n/a|

 

Meaning of “{*}SYSTEM{*}”:
 * {*}ANSI definition{*}: “{_}The method of selection is 
implementation-dependent. Each implementation shall describe the method of 
selection it uses. The selection is done based on some implementation-defined 
unit of physical storage (e.g., a page, block, or disk extent), such that each 
unit has approximately a <sample percentage> / 100 probability of being 
selected. -ISO/IEC 9075-2 (SQL/Foundation), Section 7.6{_}”
 * {*}Spark implementation{*}: In general, it maps to “RDD partition”. 
Different data sources (e.g., Iceberg, DeltaLake, Hudi) may have different 
natural “blocks”.

h2. Benefit for Spark
 * Faster exploratory queries, which is critical for large scale jobs, e.g., 
ML, GenAI.
 * One more way to achieve query cost savings (i.e., IO reduction).
 * Enabled data source level sampling. DSv2 enables sampling push down (i.e., 
*SupportsPushDownTableSample* interface) to benefit from native sampling via 
Iceberg, Delta, and Hudi connectors.
 * ANSI SQL compliance.

h2. Limitations
 * Block/Split level sampling won’t be fair/unbiased based on statistical 
sampling. This is by design, because this sampling method doesn’t read all the 
data. It prioritizes IO savings over strict and unbiased statistical sampling 
distribution.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (SPARK-55978) Block/Split Sampling SQL Augmentation

Reply via email to