Stanley Yao created SPARK-55978:
-----------------------------------
Summary: Block/Split Sampling SQL Augmentation
Key: SPARK-55978
URL: https://issues.apache.org/jira/browse/SPARK-55978
Project: Spark
Issue Type: Improvement
Components: SQL
Affects Versions: 4.1.1
Reporter: Stanley Yao
Fix For: 4.1.1
h2. TL;DR
* Points out a gap for fast development exploration via efficient table
sampling in Spark.
* Propose a SQL language augmentation that fills the gap and brings a few more
benefits.
* We’d like to gather feedback about the proposal.
* This doc intentionally omitted technical design due to its dependency on
language syntax. After we align on the language augmentation, we will present
the detailed design and implementation.
h2. Problem Statement
Users running iterative development workflows (e.g., testing queries, tuning
joins, validating schemas) routinely scan full tables even when a small
fraction of the data would suffice. For a 10 TB table with 50,000 splits, a
developer who only needs to verify output shape or spot-check results is forced
to read all 50,000 splits. All the existing sampling methods read the entire
input data and discard at the sampling step. Users could workaround it by
manually creating a down-sampled table, but it's a time-consuming and clunky
process and doesn’t support multiple sample rates.This proposal fills the gap
by offering a method to skip some IO based on the table metadata.
h2. Current Situation
This is the current Spark sampling clause:
TABLESAMPLE ( {
percentage PERCENT
| num_rows ROWS
| BUCKET fraction OUT OF total } )
[ REPEATABLE ( seed ) ]
|*Feature*| *Apache Spark*|*ANSI SQL (SQL:2003+)*|*Presto/Trino* |
|*Row level random*|TABLESAMPLE (n PERCENT)|TABLESAMPLE BERNOULLI (n
PERCENT)|TABLESAMPLE BERNOULLI (n)|
|*Block/split level*|{color:#de350b}*n/a*{color}|TABLESAMPLE SYSTEM (n
PERCENT)|TABLESAMPLE SYSTEM (n)|
|*Hash-based bucketing*|TABLESAMPLE (BUCKET x OUT OF y)|n/a|n/a|
|*Exact row count*|TABLESAMPLE (n ROWS)|n/a|n/a|
|*Repeatable seed*|REPEATABLE (seed)|REPEATABLE (seed)|n/a|
h2. Proposed Solution
We are proposing to add one more sampling method extension highlighted green
below.
TABLESAMPLE *[ SYSTEM | BERNOULLI | <other_extensions> ]* ( {
percentage PERCENT
| num_rows ROWS
| BUCKET fraction OUT OF total } )
[ REPEATABLE ( seed ) ]
|*Feature*| *Apache Spark*|*ANSI SQL (SQL:2003+)*|*Presto/Trino* |
|*Row level random*|TABLESAMPLE (n PERCENT)|TABLESAMPLE BERNOULLI (n
PERCENT)|TABLESAMPLE BERNOULLI (n)|
|*Block/split level*|{color:#00875a}*TABLESAMPLE SYSTEM (n
PERCENT)*{color}|TABLESAMPLE SYSTEM (n PERCENT)|TABLESAMPLE SYSTEM (n)|
|*Hash-based bucketing*|TABLESAMPLE (BUCKET x OUT OF y)|n/a|n/a|
|*Exact row count*|TABLESAMPLE (n ROWS)|n/a|n/a|
|*Repeatable seed*|REPEATABLE (seed)|REPEATABLE (seed)|n/a|
Meaning of “{*}SYSTEM{*}”:
* {*}ANSI definition{*}: “{_}The method of selection is
implementation-dependent. Each implementation shall describe the method of
selection it uses. The selection is done based on some implementation-defined
unit of physical storage (e.g., a page, block, or disk extent), such that each
unit has approximately a <sample percentage> / 100 probability of being
selected. -ISO/IEC 9075-2 (SQL/Foundation), Section 7.6{_}”
* {*}Spark implementation{*}: In general, it maps to “RDD partition”.
Different data sources (e.g., Iceberg, DeltaLake, Hudi) may have different
natural “blocks”.
h2. Benefit for Spark
* Faster exploratory queries, which is critical for large scale jobs, e.g.,
ML, GenAI.
* One more way to achieve query cost savings (i.e., IO reduction).
* Enabled data source level sampling. DSv2 enables sampling push down (i.e.,
*SupportsPushDownTableSample* interface) to benefit from native sampling via
Iceberg, Delta, and Hudi connectors.
* ANSI SQL compliance.
h2. Limitations
* Block/Split level sampling won’t be fair/unbiased based on statistical
sampling. This is by design, because this sampling method doesn’t read all the
data. It prioritizes IO savings over strict and unbiased statistical sampling
distribution.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]