ahshahid commented on issue #6039: URL: https://github.com/apache/iceberg/issues/6039#issuecomment-1305877706
> Some update: For tpcds query with limited data and enabling stats at manifest level for non partition cols, still does not improve perf.. the cost of dpp query is pretty high, especially for queries 14a, 14b of tpcd. But there is one thing which I am going to try is: > > 1. For non partition columns pruning, we do not need exact value of join keys in DPP. So I am going to modify the spark dpp query for non partitioning columns, to fetch max & min. I am hoping that spark-iceberg code optimizes max/min queries by computing the answer using only the stats at manifest file level.. If so , this should reduce the cost of dpp query & still allow pruning on range at various levels in iceberg... I have been given to understand that there is no such mechanism in spark- DataSourceV2 to tell the DataSource to evaluate max/min using stats if available. So I will work part time to get a prototypical change to get the max/min directly from iceberg for dpp in these cases and see its impact on perf.. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
