[
https://issues.apache.org/jira/browse/SPARK-55978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18065557#comment-18065557
]
Serge Rielau commented on SPARK-55978:
--------------------------------------
LGTM, I'm always in favor of coloring out the SQL Standard!
> [Improvement] Block/Split Sampling SQL Augmentation
> ---------------------------------------------------
>
> Key: SPARK-55978
> URL: https://issues.apache.org/jira/browse/SPARK-55978
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Affects Versions: 4.1.1
> Reporter: Stanley Yao
> Priority: Major
> Fix For: 4.1.1
>
> Original Estimate: 336h
> Remaining Estimate: 336h
>
> h2. TL;DR
> * Points out a gap for fast development exploration via efficient table
> sampling in Spark.
> * Propose a SQL language augmentation that fills the gap and brings a few
> more benefits.
> * We’d like to gather feedback about the proposal.
> * This doc intentionally omitted technical design due to its dependency on
> language syntax. After we align on the language augmentation, we will present
> the detailed design and implementation.
> h2. Problem Statement
> Users running iterative development workflows (e.g., testing queries, tuning
> joins, validating schemas) routinely scan full tables even when a small
> fraction of the data would suffice. For a 10 TB table with 50,000 splits, a
> developer who only needs to verify output shape or spot-check results is
> forced to read all 50,000 splits. All the existing sampling methods read the
> entire input data and discard at the sampling step. Users could workaround it
> by manually creating a down-sampled table, but it's a time-consuming and
> clunky process and doesn’t support multiple sample rates.This proposal fills
> the gap by offering a method to skip some IO based on the table metadata.
> h2. Current Situation
> This is the current Spark sampling clause:
> TABLESAMPLE (
> { percentage PERCENT | num_rows ROWS | BUCKET fraction OUT OF
> total }
> )
> [ REPEATABLE ( seed ) ]
> |*Feature*|*Apache Spark*|*ANSI SQL (SQL:2003+)*|*Presto/Trino* |
> |*Row level random*|TABLESAMPLE (n PERCENT)|TABLESAMPLE BERNOULLI (n
> PERCENT)|TABLESAMPLE BERNOULLI (n)|
> |*Block/split level*|{color:#de350b}*n/a*{color}|TABLESAMPLE SYSTEM (n
> PERCENT)|TABLESAMPLE SYSTEM (n)|
> |*Hash-based bucketing*|TABLESAMPLE (BUCKET x OUT OF y)|n/a|n/a|
> |*Exact row count*|TABLESAMPLE (n ROWS)|n/a|n/a|
> |*Repeatable seed*|REPEATABLE (seed)|REPEATABLE (seed)|n/a|
> h2. Proposed Solution
>
> We are proposing to add one more sampling method extension highlighted green
> below.
> TABLESAMPLE {color:#00875a}*[ SYSTEM | BERNOULLI | <other_extensions>
> ]*{color} ( \{ percentage PERCENT | num_rows ROWS | BUCKET
> fraction OUT OF total }
> )
> [ REPEATABLE ( seed ) ]
> |*Feature*|*Apache Spark*|*ANSI SQL (SQL:2003+)*|*Presto/Trino* |
> |*Row level random*|TABLESAMPLE (n PERCENT)|TABLESAMPLE BERNOULLI (n
> PERCENT)|TABLESAMPLE BERNOULLI (n)|
> |*Block/split level*|{color:#00875a}*TABLESAMPLE SYSTEM (n
> PERCENT)*{color}|TABLESAMPLE SYSTEM (n PERCENT)|TABLESAMPLE SYSTEM (n)|
> |*Hash-based bucketing*|TABLESAMPLE (BUCKET x OUT OF y)|n/a|n/a|
> |*Exact row count*|TABLESAMPLE (n ROWS)|n/a|n/a|
> |*Repeatable seed*|REPEATABLE (seed)|REPEATABLE (seed)|n/a|
>
> Meaning of “{*}SYSTEM{*}”:
> * {*}ANSI definition{*}: “{_}The method of selection is
> implementation-dependent. Each implementation shall describe the method of
> selection it uses. The selection is done based on some implementation-defined
> unit of physical storage (e.g., a page, block, or disk extent), such that
> each unit has approximately a <sample percentage> / 100 probability of being
> selected. -ISO/IEC 9075-2 (SQL/Foundation), Section 7.6{_}”
> * {*}Spark implementation{*}: In general, it maps to “RDD partition”.
> Different data sources (e.g., Iceberg, DeltaLake, Hudi) may have different
> natural “blocks”.
> h2. Benefit for Spark
> * Faster exploratory queries, which is critical for large scale jobs, e.g.,
> ML, GenAI.
> * One more way to achieve query cost savings (i.e., IO reduction).
> * Enabled data source level sampling. DSv2 enables sampling push down (i.e.,
> *SupportsPushDownTableSample* interface) to benefit from native sampling via
> Iceberg, Delta, and Hudi connectors.
> * ANSI SQL compliance.
> h2. Limitations
> * Block/Split level sampling won’t be fair/unbiased based on statistical
> sampling. This is by design, because this sampling method doesn’t read all
> the data. It prioritizes IO savings over strict and unbiased statistical
> sampling distribution.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]