rbalamohan opened a new issue, #6044:
URL: https://github.com/apache/iceberg/issues/6044

   ### Apache Iceberg version
   
   0.14.0
   
   ### Query engine
   
   Spark
   
   ### Please describe the bug 🐞
   
   Column projection/pruning is not happening in iceberg tables for inner 
queries.
   
   E.g 
   
   Q94: 
https://github.com/databricks/spark-sql-perf/blob/master/src/main/resources/tpcds_2_4/q94.sql
   
   In this query, "select * from web_sales" should have projected only relevant 
columns for further processing.
   
   This is not happening in iceberg tables, causing huge amount of data to 
scanned and shuffle. This causes regression in Q94.
   
   Note that the projection is missing in iceberg tables.
   
   _Project [ws_ship_date_sk#262, ws_ship_addr_sk#271, ws_web_site_sk#273, 
ws_warehouse_sk#275, ws_order_number#277L, ws_ext_ship_cost#288, 
ws_net_profit#293]_
   
   iceberg table:
   ============
   `
    :- SortMergeJoin [ws_order_number#141075L], [wr_order_number#141211L], 
LeftAnti
                             :     :           :  :- Project 
[ws_ship_date_sk#141060, ws_ship_addr_sk#141069, ws_web_site_sk#141071, 
ws_order_number#141075L, ws_ext_ship_cost#141086, ws_net_profit#141091]
                             :     :           :  :  +- SortMergeJoin 
[ws_order_number#141075L], [ws_order_number#141176L], LeftSemi, NOT 
(ws_warehouse_sk#141073 = ws_warehouse_sk#141174)
                             :     :           :  :     :- Sort 
[ws_order_number#141075L ASC NULLS FIRST], false, 0
                             :     :           :  :     :  +- Exchange 
hashpartitioning(ws_order_number#141075L, 240), ENSURE_REQUIREMENTS, 
[id=#627348]
                             :     :           :  :     :     +- Filter 
((isnotnull(ws_ship_date_sk#141060) AND isnotnull(ws_ship_addr_sk#141069)) AND 
isnotnull(ws_web_site_sk#141071))
                             :     :           :  :     :        +- BatchScan 
Iceberg 
spark_catalog.tpcds_sf1000_withdecimal_withdate_withnulls_iceberg_overwrite_inv_part.web_sales[ws_ship_date_sk#141060,
 ws_ship_addr_sk#141069, ws_web_site_sk#141071, ws_warehouse_sk#141073, 
ws_order_number#141075L, ws_ext_ship_cost#141
                             :     :           :  :     +- Sort 
[ws_order_number#141176L ASC NULLS FIRST], false, 0
                             :     :           :  :        +- Exchange 
hashpartitioning(ws_order_number#141176L, 240), ENSURE_REQUIREMENTS, 
[id=#627349]
                             :     :           :  :           +- Project 
[ws_warehouse_sk#141174, ws_order_number#141176L]
                             :     :           :  :              +- BatchScan 
Iceberg 
spark_catalog.tpcds_sf1000_withdecimal_withdate_withnulls_iceberg_overwrite_inv_part.web_sales[ws_sold_time_sk#141160,
 ws_ship_date_sk#141161, ws_item_sk#141162, ws_bill_customer_sk#141163, 
ws_bill_cdemo_sk#141164, ws_bill_hdemo_sk#141
                             :     :           :  +- Sort 
[wr_order_number#141211L ASC NULLS FIRST], false, 0
   `
   
   
   
   
   Regular table:
   ===========
   `
     :- SortMergeJoin [ws_order_number#277L], [wr_order_number#16620L], LeftAnti
                        :     :     :  :- Project [ws_ship_date_sk#262, 
ws_ship_addr_sk#271, ws_web_site_sk#273, ws_order_number#277L, 
ws_ext_ship_cost#288, ws_net_profit#293]
                        :     :     :  :  +- SortMergeJoin 
[ws_order_number#277L], [ws_order_number#66133L], LeftSemi, NOT 
(ws_warehouse_sk#275 = ws_warehouse_sk#66131)
                        :     :     :  :     :- Sort [ws_order_number#277L ASC 
NULLS FIRST], false, 0
                        :     :     :  :     :  +- Exchange 
hashpartitioning(ws_order_number#277L, 240), ENSURE_REQUIREMENTS, [id=#593691]
                        :     :     :  :     :     +- Project 
[ws_ship_date_sk#262, ws_ship_addr_sk#271, ws_web_site_sk#273, 
ws_warehouse_sk#275, ws_order_number#277L, ws_ext_ship_cost#288, 
ws_net_profit#293]
                        :     :     :  :     :        +- Filter 
((isnotnull(ws_ship_date_sk#262) AND isnotnull(ws_ship_addr_sk#271)) AND 
isnotnull(ws_web_site_sk#273))
                        :     :     :  :     :           +- FileScan parquet 
tpcds_sf1000_withdecimal_withdate_withnulls.web_sales[ws_ship_date_sk#262,ws_ship_addr_sk#271,ws_web_site_sk#273,ws_warehouse_sk#275,ws_order_number#277L,ws_ext_ship_cost#288,ws_net_profit#293,ws_sold_date_sk#294]
 Batched: true, DataFilters: [isnotnull(ws_ship_date_sk#262), 
isnotnull(ws_ship_addr_sk#271), isnotnull(ws_web_site_sk#273)], Format: 
Parquet, Location: CatalogFileIndex(1 
paths)[s3a://nfqe-tpcds-test/spark-tpcds/sf1000-parquet/useDecimal=true,useDat...,
 PartitionFilters: [], PushedFilters: [IsNotNull(ws_ship_date_sk), 
IsNotNull(ws_ship_addr_sk), IsNotNull(ws_web_site_sk)], ReadSchema: 
struct<ws_ship_date_sk:int,ws_ship_addr_sk:int,ws_web_site_sk:int,ws_warehouse_sk:int,ws_order_nu...
                        :     :     :  :     +- Sort [ws_order_number#66133L 
ASC NULLS FIRST], false, 0
                        :     :     :  :        +- Exchange 
hashpartitioning(ws_order_number#66133L, 240), ENSURE_REQUIREMENTS, [id=#593692]
                        :     :     :  :           +- Project 
[ws_warehouse_sk#66131, ws_order_number#66133L]
                        :     :     :  :              +- FileScan parquet 
tpcds_sf1000_withdecimal_withdate_withnulls.web_sales[ws_warehouse_sk#66131,ws_order_number#66133L,ws_sold_date_sk#66150]
 Batched: true, DataFilters: [], Format: Parquet, Location: CatalogFileIndex(1 
paths)[s3a://nfqe-tpcds-test/spark-tpcds/sf1000-parquet/useDecimal=true,useDat...,
 PartitionFilters: [], PushedFilters: [], ReadSchema: 
struct<ws_warehouse_sk:int,ws_order_number:bigint>
                        :     :     :  +- Sort [wr_order_number#16620L ASC 
NULLS FIRST], false, 0
   `
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to