[ 
https://issues.apache.org/jira/browse/SPARK-55978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stanley Yao updated SPARK-55978:
--------------------------------
    Description: 
h2. TL;DR
 * Points out a gap for fast development exploration via efficient table 
sampling in Spark.
 * Propose a SQL language augmentation that fills the gap and brings a few more 
benefits.
 * We’d like to gather feedback about the proposal.
 * This doc intentionally omitted technical design due to its dependency on 
language syntax. After we align on the language augmentation, we will present 
the detailed design and implementation.  

h2. Problem Statement

Users running iterative development workflows (e.g., testing queries, tuning 
joins, validating schemas) routinely scan full tables even when a small 
fraction of the data would suffice. For a 10 TB table with 50,000 splits, a 
developer who only needs to verify output shape or spot-check results is forced 
to read all 50,000 splits. All the existing sampling methods read the entire 
input data and discard at the sampling step. Users could workaround it by 
manually creating a down-sampled table, but it's a time-consuming and clunky 
process and doesn’t support multiple sample rates.This proposal fills the gap 
by offering a method to skip some IO based on the table metadata.
h2. Current Situation

This is the current Spark sampling clause:

TABLESAMPLE (

{     percentage PERCENT     | num_rows ROWS     | BUCKET fraction OUT OF total 
}

)

[ REPEATABLE ( seed ) ]
|*Feature*|*Apache Spark*|*ANSI SQL (SQL:2003+)*|*Presto/Trino* |
|*Row level random*|TABLESAMPLE (n PERCENT)|TABLESAMPLE BERNOULLI (n 
PERCENT)|TABLESAMPLE BERNOULLI (n)|
|*Block/split level*|{color:#de350b}*n/a*{color}|TABLESAMPLE SYSTEM (n 
PERCENT)|TABLESAMPLE SYSTEM (n)|
|*Hash-based bucketing*|TABLESAMPLE (BUCKET x OUT OF y)|n/a|n/a|
|*Exact row count*|TABLESAMPLE (n ROWS)|n/a|n/a|
|*Repeatable seed*|REPEATABLE (seed)|REPEATABLE (seed)|n/a|
h2. Proposed Solution

 

We are proposing to add one more sampling method extension highlighted green 
below.

TABLESAMPLE {color:#00875a}*[ SYSTEM | BERNOULLI | <other_extensions> ]*{color} 
( \{     percentage PERCENT     | num_rows ROWS     | BUCKET fraction OUT OF 
total }

)

[ REPEATABLE ( seed ) ]
|*Feature*|*Apache Spark*|*ANSI SQL (SQL:2003+)*|*Presto/Trino* |
|*Row level random*|TABLESAMPLE (n PERCENT)|TABLESAMPLE BERNOULLI (n 
PERCENT)|TABLESAMPLE BERNOULLI (n)|
|*Block/split level*|{color:#00875a}*TABLESAMPLE SYSTEM (n 
PERCENT)*{color}|TABLESAMPLE SYSTEM (n PERCENT)|TABLESAMPLE SYSTEM (n)|
|*Hash-based bucketing*|TABLESAMPLE (BUCKET x OUT OF y)|n/a|n/a|
|*Exact row count*|TABLESAMPLE (n ROWS)|n/a|n/a|
|*Repeatable seed*|REPEATABLE (seed)|REPEATABLE (seed)|n/a|

 

Meaning of “{*}SYSTEM{*}”:
 * {*}ANSI definition{*}: “{_}The method of selection is 
implementation-dependent. Each implementation shall describe the method of 
selection it uses. The selection is done based on some implementation-defined 
unit of physical storage (e.g., a page, block, or disk extent), such that each 
unit has approximately a <sample percentage> / 100 probability of being 
selected. -ISO/IEC 9075-2 (SQL/Foundation), Section 7.6{_}”
 * {*}Spark implementation{*}: In general, it maps to “RDD partition”. 
Different data sources (e.g., Iceberg, DeltaLake, Hudi) may have different 
natural “blocks”.

h2. Benefit for Spark
 * Faster exploratory queries, which is critical for large scale jobs, e.g., 
ML, GenAI.
 * One more way to achieve query cost savings (i.e., IO reduction).
 * Enabled data source level sampling. DSv2 enables sampling push down (i.e., 
*SupportsPushDownTableSample* interface) to benefit from native sampling via 
Iceberg, Delta, and Hudi connectors.
 * ANSI SQL compliance.

h2. Limitations
 * Block/Split level sampling won’t be fair/unbiased based on statistical 
sampling. This is by design, because this sampling method doesn’t read all the 
data. It prioritizes IO savings over strict and unbiased statistical sampling 
distribution.

  was:
h2. TL;DR
 * Points out a gap for fast development exploration via efficient table 
sampling in Spark.
 * Propose a SQL language augmentation that fills the gap and brings a few more 
benefits.
 * We’d like to gather feedback about the proposal.
 * This doc intentionally omitted technical design due to its dependency on 
language syntax. After we align on the language augmentation, we will present 
the detailed design and implementation.  

h2. Problem Statement

Users running iterative development workflows (e.g., testing queries, tuning 
joins, validating schemas) routinely scan full tables even when a small 
fraction of the data would suffice. For a 10 TB table with 50,000 splits, a 
developer who only needs to verify output shape or spot-check results is forced 
to read all 50,000 splits. All the existing sampling methods read the entire 
input data and discard at the sampling step. Users could workaround it by 
manually creating a down-sampled table, but it's a time-consuming and clunky 
process and doesn’t support multiple sample rates.This proposal fills the gap 
by offering a method to skip some IO based on the table metadata.
h2. Current Situation

This is the current Spark sampling clause:

TABLESAMPLE (

{     percentage PERCENT     | num_rows ROWS     | BUCKET fraction OUT OF total 
}

)

[ REPEATABLE ( seed ) ]
|*Feature*|*Apache Spark*|*ANSI SQL (SQL:2003+)*|*Presto/Trino* |
|*Row level random*|TABLESAMPLE (n PERCENT)|TABLESAMPLE BERNOULLI (n 
PERCENT)|TABLESAMPLE BERNOULLI (n)|
|*Block/split level*|{color:#de350b}*n/a*{color}|TABLESAMPLE SYSTEM (n 
PERCENT)|TABLESAMPLE SYSTEM (n)|
|*Hash-based bucketing*|TABLESAMPLE (BUCKET x OUT OF y)|n/a|n/a|
|*Exact row count*|TABLESAMPLE (n ROWS)|n/a|n/a|
|*Repeatable seed*|REPEATABLE (seed)|REPEATABLE (seed)|n/a|
h2. Proposed Solution

 

We are proposing to add one more sampling method extension highlighted green 
below.

TABLESAMPLE *[ SYSTEM | BERNOULLI | <other_extensions> ]* ( \{     percentage 
PERCENT     | num_rows ROWS     | BUCKET fraction OUT OF total }

)

[ REPEATABLE ( seed ) ]
|*Feature*|*Apache Spark*|*ANSI SQL (SQL:2003+)*|*Presto/Trino* |
|*Row level random*|TABLESAMPLE (n PERCENT)|TABLESAMPLE BERNOULLI (n 
PERCENT)|TABLESAMPLE BERNOULLI (n)|
|*Block/split level*|{color:#00875a}*TABLESAMPLE SYSTEM (n 
PERCENT)*{color}|TABLESAMPLE SYSTEM (n PERCENT)|TABLESAMPLE SYSTEM (n)|
|*Hash-based bucketing*|TABLESAMPLE (BUCKET x OUT OF y)|n/a|n/a|
|*Exact row count*|TABLESAMPLE (n ROWS)|n/a|n/a|
|*Repeatable seed*|REPEATABLE (seed)|REPEATABLE (seed)|n/a|

 

Meaning of “{*}SYSTEM{*}”:
 * {*}ANSI definition{*}: “{_}The method of selection is 
implementation-dependent. Each implementation shall describe the method of 
selection it uses. The selection is done based on some implementation-defined 
unit of physical storage (e.g., a page, block, or disk extent), such that each 
unit has approximately a <sample percentage> / 100 probability of being 
selected. -ISO/IEC 9075-2 (SQL/Foundation), Section 7.6{_}”
 * {*}Spark implementation{*}: In general, it maps to “RDD partition”. 
Different data sources (e.g., Iceberg, DeltaLake, Hudi) may have different 
natural “blocks”.

h2. Benefit for Spark
 * Faster exploratory queries, which is critical for large scale jobs, e.g., 
ML, GenAI.
 * One more way to achieve query cost savings (i.e., IO reduction).
 * Enabled data source level sampling. DSv2 enables sampling push down (i.e., 
*SupportsPushDownTableSample* interface) to benefit from native sampling via 
Iceberg, Delta, and Hudi connectors.
 * ANSI SQL compliance.

h2. Limitations
 * Block/Split level sampling won’t be fair/unbiased based on statistical 
sampling. This is by design, because this sampling method doesn’t read all the 
data. It prioritizes IO savings over strict and unbiased statistical sampling 
distribution.


> [Improvement] Block/Split Sampling SQL Augmentation
> ---------------------------------------------------
>
>                 Key: SPARK-55978
>                 URL: https://issues.apache.org/jira/browse/SPARK-55978
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 4.1.1
>            Reporter: Stanley Yao
>            Priority: Major
>             Fix For: 4.1.1
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> h2. TL;DR
>  * Points out a gap for fast development exploration via efficient table 
> sampling in Spark.
>  * Propose a SQL language augmentation that fills the gap and brings a few 
> more benefits.
>  * We’d like to gather feedback about the proposal.
>  * This doc intentionally omitted technical design due to its dependency on 
> language syntax. After we align on the language augmentation, we will present 
> the detailed design and implementation.  
> h2. Problem Statement
> Users running iterative development workflows (e.g., testing queries, tuning 
> joins, validating schemas) routinely scan full tables even when a small 
> fraction of the data would suffice. For a 10 TB table with 50,000 splits, a 
> developer who only needs to verify output shape or spot-check results is 
> forced to read all 50,000 splits. All the existing sampling methods read the 
> entire input data and discard at the sampling step. Users could workaround it 
> by manually creating a down-sampled table, but it's a time-consuming and 
> clunky process and doesn’t support multiple sample rates.This proposal fills 
> the gap by offering a method to skip some IO based on the table metadata.
> h2. Current Situation
> This is the current Spark sampling clause:
> TABLESAMPLE (
> {     percentage PERCENT     | num_rows ROWS     | BUCKET fraction OUT OF 
> total }
> )
> [ REPEATABLE ( seed ) ]
> |*Feature*|*Apache Spark*|*ANSI SQL (SQL:2003+)*|*Presto/Trino* |
> |*Row level random*|TABLESAMPLE (n PERCENT)|TABLESAMPLE BERNOULLI (n 
> PERCENT)|TABLESAMPLE BERNOULLI (n)|
> |*Block/split level*|{color:#de350b}*n/a*{color}|TABLESAMPLE SYSTEM (n 
> PERCENT)|TABLESAMPLE SYSTEM (n)|
> |*Hash-based bucketing*|TABLESAMPLE (BUCKET x OUT OF y)|n/a|n/a|
> |*Exact row count*|TABLESAMPLE (n ROWS)|n/a|n/a|
> |*Repeatable seed*|REPEATABLE (seed)|REPEATABLE (seed)|n/a|
> h2. Proposed Solution
>  
> We are proposing to add one more sampling method extension highlighted green 
> below.
> TABLESAMPLE {color:#00875a}*[ SYSTEM | BERNOULLI | <other_extensions> 
> ]*{color} ( \{     percentage PERCENT     | num_rows ROWS     | BUCKET 
> fraction OUT OF total }
> )
> [ REPEATABLE ( seed ) ]
> |*Feature*|*Apache Spark*|*ANSI SQL (SQL:2003+)*|*Presto/Trino* |
> |*Row level random*|TABLESAMPLE (n PERCENT)|TABLESAMPLE BERNOULLI (n 
> PERCENT)|TABLESAMPLE BERNOULLI (n)|
> |*Block/split level*|{color:#00875a}*TABLESAMPLE SYSTEM (n 
> PERCENT)*{color}|TABLESAMPLE SYSTEM (n PERCENT)|TABLESAMPLE SYSTEM (n)|
> |*Hash-based bucketing*|TABLESAMPLE (BUCKET x OUT OF y)|n/a|n/a|
> |*Exact row count*|TABLESAMPLE (n ROWS)|n/a|n/a|
> |*Repeatable seed*|REPEATABLE (seed)|REPEATABLE (seed)|n/a|
>  
> Meaning of “{*}SYSTEM{*}”:
>  * {*}ANSI definition{*}: “{_}The method of selection is 
> implementation-dependent. Each implementation shall describe the method of 
> selection it uses. The selection is done based on some implementation-defined 
> unit of physical storage (e.g., a page, block, or disk extent), such that 
> each unit has approximately a <sample percentage> / 100 probability of being 
> selected. -ISO/IEC 9075-2 (SQL/Foundation), Section 7.6{_}”
>  * {*}Spark implementation{*}: In general, it maps to “RDD partition”. 
> Different data sources (e.g., Iceberg, DeltaLake, Hudi) may have different 
> natural “blocks”.
> h2. Benefit for Spark
>  * Faster exploratory queries, which is critical for large scale jobs, e.g., 
> ML, GenAI.
>  * One more way to achieve query cost savings (i.e., IO reduction).
>  * Enabled data source level sampling. DSv2 enables sampling push down (i.e., 
> *SupportsPushDownTableSample* interface) to benefit from native sampling via 
> Iceberg, Delta, and Hudi connectors.
>  * ANSI SQL compliance.
> h2. Limitations
>  * Block/Split level sampling won’t be fair/unbiased based on statistical 
> sampling. This is by design, because this sampling method doesn’t read all 
> the data. It prioritizes IO savings over strict and unbiased statistical 
> sampling distribution.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to