[ 
https://issues.apache.org/jira/browse/SPARK-32564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-32564:
----------------------------------
    Fix Version/s: 3.0.1

>  Inject data statistics to simulate plan generation on actual TPCDS data
> ------------------------------------------------------------------------
>
>                 Key: SPARK-32564
>                 URL: https://issues.apache.org/jira/browse/SPARK-32564
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL, Tests
>    Affects Versions: 3.1.0
>            Reporter: Takeshi Yamamuro
>            Assignee: Takeshi Yamamuro
>            Priority: Major
>             Fix For: 3.0.1, 3.1.0
>
>
> `TPCDSQuerySuite` currently computes plans with empty TPCDS tables, then 
> checks if plans can be generated correctly. But, the generated plans can be 
> different from actual ones because the input tables are empty (e.g., the 
> plans always use broadcast-hash joins, but actual ones use sort-merge joins 
> for larger tables). To mitigate the issue, this ticket targets at defining 
> data statistics constants extracted from generated TPCDS data in 
> `TPCDSTableStats`, then injects the statistics via 
> `spark.sessionState.catalog.alterTableStats` when defining TPCDS tables in 
> `TPCDSQuerySuite`.
> Please see a link below about how to extract the table statistics:
>  - https://gist.github.com/maropu/f553d32c323ee803d39e2f7fa0b5a8c3
> For example, the generated plans of TPCDS `q2` are different with/without 
> this fix:
> {code:java}
> ==== w/ this fix: q2 ====
> == Physical Plan ==
> * Sort (43)
> +- Exchange (42)
>  +- * Project (41)
>  +- * SortMergeJoin Inner (40)
>  :- * Sort (28)
>  : +- Exchange (27)
>  : +- * Project (26)
>  : +- * BroadcastHashJoin Inner BuildRight (25)
>  : :- * HashAggregate (19)
>  : : +- Exchange (18)
>  : : +- * HashAggregate (17)
>  : : +- * Project (16)
>  : : +- * BroadcastHashJoin Inner BuildRight (15)
>  : : :- Union (9)
>  : : : :- * Project (4)
>  : : : : +- * Filter (3)
>  : : : : +- * ColumnarToRow (2)
>  : : : : +- Scan parquet default.web_sales (1)
>  : : : +- * Project (8)
>  : : : +- * Filter (7)
>  : : : +- * ColumnarToRow (6)
>  : : : +- Scan parquet default.catalog_sales (5)
>  : : +- BroadcastExchange (14)
>  : : +- * Project (13)
>  : : +- * Filter (12)
>  : : +- * ColumnarToRow (11)
>  : : +- Scan parquet default.date_dim (10)
>  : +- BroadcastExchange (24)
>  : +- * Project (23)
>  : +- * Filter (22)
>  : +- * ColumnarToRow (21)
>  : +- Scan parquet default.date_dim (20)
>  +- * Sort (39)
>  +- Exchange (38)
>  +- * Project (37)
>  +- * BroadcastHashJoin Inner BuildRight (36)
>  :- * HashAggregate (30)
>  : +- ReusedExchange (29)
>  +- BroadcastExchange (35)
>  +- * Project (34)
>  +- * Filter (33)
>  +- * ColumnarToRow (32)
>  +- Scan parquet default.date_dim (31)
> ==== w/o this fix: q2 ====
> == Physical Plan ==
> * Sort (40)
> +- Exchange (39)
>  +- * Project (38)
>  +- * BroadcastHashJoin Inner BuildRight (37)
>  :- * Project (26)
>  : +- * BroadcastHashJoin Inner BuildRight (25)
>  : :- * HashAggregate (19)
>  : : +- Exchange (18)
>  : : +- * HashAggregate (17)
>  : : +- * Project (16)
>  : : +- * BroadcastHashJoin Inner BuildRight (15)
>  : : :- Union (9)
>  : : : :- * Project (4)
>  : : : : +- * Filter (3)
>  : : : : +- * ColumnarToRow (2)
>  : : : : +- Scan parquet default.web_sales (1)
>  : : : +- * Project (8)
>  : : : +- * Filter (7)
>  : : : +- * ColumnarToRow (6)
>  : : : +- Scan parquet default.catalog_sales (5)
>  : : +- BroadcastExchange (14)
>  : : +- * Project (13)
>  : : +- * Filter (12)
>  : : +- * ColumnarToRow (11)
>  : : +- Scan parquet default.date_dim (10)
>  : +- BroadcastExchange (24)
>  : +- * Project (23)
>  : +- * Filter (22)
>  : +- * ColumnarToRow (21)
>  : +- Scan parquet default.date_dim (20)
>  +- BroadcastExchange (36)
>  +- * Project (35)
>  +- * BroadcastHashJoin Inner BuildRight (34)
>  :- * HashAggregate (28)
>  : +- ReusedExchange (27)
>  +- BroadcastExchange (33)
>  +- * Project (32)
>  +- * Filter (31)
>  +- * ColumnarToRow (30)
>  +- Scan parquet default.date_dim (29)
>  {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to