[GitHub] [iceberg] ahshahid commented on issue #6039: Spark : Perf enhancement by leveraging Dynamic Partition Pruning rule of spark for non partition columns used as join condition

GitBox Mon, 07 Nov 2022 08:36:37 -0800


ahshahid commented on issue #6039:
URL: https://github.com/apache/iceberg/issues/6039#issuecomment-1305877706


   > Some update: For tpcds query with limited data and enabling stats at 
manifest level for non partition cols, still does not improve perf.. the cost 
of dpp query is pretty high, especially for queries 14a, 14b of tpcd. But there 
is one thing which I am going to try is:
   > 
   > 1. For non partition columns pruning, we do not need exact value of join 
keys in DPP. So I am going to modify the spark dpp query for non partitioning 
columns, to fetch max & min.  I am hoping that spark-iceberg code optimizes 
max/min queries by computing the answer using only the stats at manifest file 
level.. If so , this should reduce the cost of dpp query & still allow pruning 
on range at various levels in iceberg...
   
   I have been given to understand that there is no such mechanism in spark- 
DataSourceV2 to tell the DataSource to evaluate max/min using stats if 
available. So I will work part time to get a prototypical change to get the 
max/min directly from iceberg for dpp in these cases and see its impact on 
perf..


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] ahshahid commented on issue #6039: Spark : Perf enhancement by leveraging Dynamic Partition Pruning rule of spark for non partition columns used as join condition

Reply via email to