rbalamohan opened a new issue, #6044: URL: https://github.com/apache/iceberg/issues/6044
### Apache Iceberg version 0.14.0 ### Query engine Spark ### Please describe the bug 🐞 Column projection/pruning is not happening in iceberg tables for inner queries. E.g Q94: https://github.com/databricks/spark-sql-perf/blob/master/src/main/resources/tpcds_2_4/q94.sql In this query, "select * from web_sales" should have projected only relevant columns for further processing. This is not happening in iceberg tables, causing huge amount of data to scanned and shuffle. This causes regression in Q94. Note that the projection is missing in iceberg tables. _Project [ws_ship_date_sk#262, ws_ship_addr_sk#271, ws_web_site_sk#273, ws_warehouse_sk#275, ws_order_number#277L, ws_ext_ship_cost#288, ws_net_profit#293]_ iceberg table: ============ ` :- SortMergeJoin [ws_order_number#141075L], [wr_order_number#141211L], LeftAnti : : : :- Project [ws_ship_date_sk#141060, ws_ship_addr_sk#141069, ws_web_site_sk#141071, ws_order_number#141075L, ws_ext_ship_cost#141086, ws_net_profit#141091] : : : : +- SortMergeJoin [ws_order_number#141075L], [ws_order_number#141176L], LeftSemi, NOT (ws_warehouse_sk#141073 = ws_warehouse_sk#141174) : : : : :- Sort [ws_order_number#141075L ASC NULLS FIRST], false, 0 : : : : : +- Exchange hashpartitioning(ws_order_number#141075L, 240), ENSURE_REQUIREMENTS, [id=#627348] : : : : : +- Filter ((isnotnull(ws_ship_date_sk#141060) AND isnotnull(ws_ship_addr_sk#141069)) AND isnotnull(ws_web_site_sk#141071)) : : : : : +- BatchScan Iceberg spark_catalog.tpcds_sf1000_withdecimal_withdate_withnulls_iceberg_overwrite_inv_part.web_sales[ws_ship_date_sk#141060, ws_ship_addr_sk#141069, ws_web_site_sk#141071, ws_warehouse_sk#141073, ws_order_number#141075L, ws_ext_ship_cost#141 : : : : +- Sort [ws_order_number#141176L ASC NULLS FIRST], false, 0 : : : : +- Exchange hashpartitioning(ws_order_number#141176L, 240), ENSURE_REQUIREMENTS, [id=#627349] : : : : +- Project [ws_warehouse_sk#141174, ws_order_number#141176L] : : : : +- BatchScan Iceberg spark_catalog.tpcds_sf1000_withdecimal_withdate_withnulls_iceberg_overwrite_inv_part.web_sales[ws_sold_time_sk#141160, ws_ship_date_sk#141161, ws_item_sk#141162, ws_bill_customer_sk#141163, ws_bill_cdemo_sk#141164, ws_bill_hdemo_sk#141 : : : +- Sort [wr_order_number#141211L ASC NULLS FIRST], false, 0 ` Regular table: =========== ` :- SortMergeJoin [ws_order_number#277L], [wr_order_number#16620L], LeftAnti : : : :- Project [ws_ship_date_sk#262, ws_ship_addr_sk#271, ws_web_site_sk#273, ws_order_number#277L, ws_ext_ship_cost#288, ws_net_profit#293] : : : : +- SortMergeJoin [ws_order_number#277L], [ws_order_number#66133L], LeftSemi, NOT (ws_warehouse_sk#275 = ws_warehouse_sk#66131) : : : : :- Sort [ws_order_number#277L ASC NULLS FIRST], false, 0 : : : : : +- Exchange hashpartitioning(ws_order_number#277L, 240), ENSURE_REQUIREMENTS, [id=#593691] : : : : : +- Project [ws_ship_date_sk#262, ws_ship_addr_sk#271, ws_web_site_sk#273, ws_warehouse_sk#275, ws_order_number#277L, ws_ext_ship_cost#288, ws_net_profit#293] : : : : : +- Filter ((isnotnull(ws_ship_date_sk#262) AND isnotnull(ws_ship_addr_sk#271)) AND isnotnull(ws_web_site_sk#273)) : : : : : +- FileScan parquet tpcds_sf1000_withdecimal_withdate_withnulls.web_sales[ws_ship_date_sk#262,ws_ship_addr_sk#271,ws_web_site_sk#273,ws_warehouse_sk#275,ws_order_number#277L,ws_ext_ship_cost#288,ws_net_profit#293,ws_sold_date_sk#294] Batched: true, DataFilters: [isnotnull(ws_ship_date_sk#262), isnotnull(ws_ship_addr_sk#271), isnotnull(ws_web_site_sk#273)], Format: Parquet, Location: CatalogFileIndex(1 paths)[s3a://nfqe-tpcds-test/spark-tpcds/sf1000-parquet/useDecimal=true,useDat..., PartitionFilters: [], PushedFilters: [IsNotNull(ws_ship_date_sk), IsNotNull(ws_ship_addr_sk), IsNotNull(ws_web_site_sk)], ReadSchema: struct<ws_ship_date_sk:int,ws_ship_addr_sk:int,ws_web_site_sk:int,ws_warehouse_sk:int,ws_order_nu... : : : : +- Sort [ws_order_number#66133L ASC NULLS FIRST], false, 0 : : : : +- Exchange hashpartitioning(ws_order_number#66133L, 240), ENSURE_REQUIREMENTS, [id=#593692] : : : : +- Project [ws_warehouse_sk#66131, ws_order_number#66133L] : : : : +- FileScan parquet tpcds_sf1000_withdecimal_withdate_withnulls.web_sales[ws_warehouse_sk#66131,ws_order_number#66133L,ws_sold_date_sk#66150] Batched: true, DataFilters: [], Format: Parquet, Location: CatalogFileIndex(1 paths)[s3a://nfqe-tpcds-test/spark-tpcds/sf1000-parquet/useDecimal=true,useDat..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<ws_warehouse_sk:int,ws_order_number:bigint> : : : +- Sort [wr_order_number#16620L ASC NULLS FIRST], false, 0 ` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
