[ https://issues.apache.org/jira/browse/SPARK-32564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Dongjoon Hyun updated SPARK-32564: ---------------------------------- Fix Version/s: 3.0.1 > Inject data statistics to simulate plan generation on actual TPCDS data > ------------------------------------------------------------------------ > > Key: SPARK-32564 > URL: https://issues.apache.org/jira/browse/SPARK-32564 > Project: Spark > Issue Type: Improvement > Components: SQL, Tests > Affects Versions: 3.1.0 > Reporter: Takeshi Yamamuro > Assignee: Takeshi Yamamuro > Priority: Major > Fix For: 3.0.1, 3.1.0 > > > `TPCDSQuerySuite` currently computes plans with empty TPCDS tables, then > checks if plans can be generated correctly. But, the generated plans can be > different from actual ones because the input tables are empty (e.g., the > plans always use broadcast-hash joins, but actual ones use sort-merge joins > for larger tables). To mitigate the issue, this ticket targets at defining > data statistics constants extracted from generated TPCDS data in > `TPCDSTableStats`, then injects the statistics via > `spark.sessionState.catalog.alterTableStats` when defining TPCDS tables in > `TPCDSQuerySuite`. > Please see a link below about how to extract the table statistics: > - https://gist.github.com/maropu/f553d32c323ee803d39e2f7fa0b5a8c3 > For example, the generated plans of TPCDS `q2` are different with/without > this fix: > {code:java} > ==== w/ this fix: q2 ==== > == Physical Plan == > * Sort (43) > +- Exchange (42) > +- * Project (41) > +- * SortMergeJoin Inner (40) > :- * Sort (28) > : +- Exchange (27) > : +- * Project (26) > : +- * BroadcastHashJoin Inner BuildRight (25) > : :- * HashAggregate (19) > : : +- Exchange (18) > : : +- * HashAggregate (17) > : : +- * Project (16) > : : +- * BroadcastHashJoin Inner BuildRight (15) > : : :- Union (9) > : : : :- * Project (4) > : : : : +- * Filter (3) > : : : : +- * ColumnarToRow (2) > : : : : +- Scan parquet default.web_sales (1) > : : : +- * Project (8) > : : : +- * Filter (7) > : : : +- * ColumnarToRow (6) > : : : +- Scan parquet default.catalog_sales (5) > : : +- BroadcastExchange (14) > : : +- * Project (13) > : : +- * Filter (12) > : : +- * ColumnarToRow (11) > : : +- Scan parquet default.date_dim (10) > : +- BroadcastExchange (24) > : +- * Project (23) > : +- * Filter (22) > : +- * ColumnarToRow (21) > : +- Scan parquet default.date_dim (20) > +- * Sort (39) > +- Exchange (38) > +- * Project (37) > +- * BroadcastHashJoin Inner BuildRight (36) > :- * HashAggregate (30) > : +- ReusedExchange (29) > +- BroadcastExchange (35) > +- * Project (34) > +- * Filter (33) > +- * ColumnarToRow (32) > +- Scan parquet default.date_dim (31) > ==== w/o this fix: q2 ==== > == Physical Plan == > * Sort (40) > +- Exchange (39) > +- * Project (38) > +- * BroadcastHashJoin Inner BuildRight (37) > :- * Project (26) > : +- * BroadcastHashJoin Inner BuildRight (25) > : :- * HashAggregate (19) > : : +- Exchange (18) > : : +- * HashAggregate (17) > : : +- * Project (16) > : : +- * BroadcastHashJoin Inner BuildRight (15) > : : :- Union (9) > : : : :- * Project (4) > : : : : +- * Filter (3) > : : : : +- * ColumnarToRow (2) > : : : : +- Scan parquet default.web_sales (1) > : : : +- * Project (8) > : : : +- * Filter (7) > : : : +- * ColumnarToRow (6) > : : : +- Scan parquet default.catalog_sales (5) > : : +- BroadcastExchange (14) > : : +- * Project (13) > : : +- * Filter (12) > : : +- * ColumnarToRow (11) > : : +- Scan parquet default.date_dim (10) > : +- BroadcastExchange (24) > : +- * Project (23) > : +- * Filter (22) > : +- * ColumnarToRow (21) > : +- Scan parquet default.date_dim (20) > +- BroadcastExchange (36) > +- * Project (35) > +- * BroadcastHashJoin Inner BuildRight (34) > :- * HashAggregate (28) > : +- ReusedExchange (27) > +- BroadcastExchange (33) > +- * Project (32) > +- * Filter (31) > +- * ColumnarToRow (30) > +- Scan parquet default.date_dim (29) > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org