deniskuzZ commented on PR #5215:
URL: https://github.com/apache/hive/pull/5215#issuecomment-2194719003

   > > @zhangbutao, do you know if iceberg provides partition row_count stats? 
https://docs.google.com/document/d/1vaufuD47kMijz97LxM67X8OX-W2Wq7nmlz3jRo8J5Qk 
if not, maybe we can get it from meta table:
   > > ```
   > > SELECT record_count FROM prod.db.table.partitions where spec_id in (....)
   > > ```
   > 
   > @deniskuzZ After some chek, i found that computing partitions stats by 
existing partitions metadata api which is used by `SELECT record_count FROM 
prod.db.table.partitions where spec_id in (....)` is very expensive if table 
has many data files, like the doc said:
   > 
   > 
![partitions_stats](https://private-user-images.githubusercontent.com/9760681/341370914-c31a739f-9020-4634-8dd2-d9ea8bff44e0.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTk0OTU3MTQsIm5iZiI6MTcxOTQ5NTQxNCwicGF0aCI6Ii85NzYwNjgxLzM0MTM3MDkxNC1jMzFhNzM5Zi05MDIwLTQ2MzQtOGRkMi1kOWVhOGJmZjQ0ZTAucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI0MDYyNyUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNDA2MjdUMTMzNjU0WiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9NDdjNWVhZGMxZWNmMzk5NDQ2ZmUwYzQ4MDljNjFmYTVhZjE4NjZkNmJlMGNkZjJjYjcyNGM1ZTRkMDk2MDhlNCZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QmYWN0b3JfaWQ9MCZrZXlfaWQ9MCZyZXBvX2lkPTAifQ.odIZqauAYGe2CP1yeraK_sdBDfZNgleUbByS2cwJWlU)
   > 
   > The best way is that we can try to use some new partitons stats api added 
in Iceberg 1.5.0. We can do the partitions `row count` optimization after 
upgrading iceberg to 1.5.x. WDYT?
   
   i don't know the details, need to check the doc, but. if iceberg exposes 
record_count via partitions metatable why select would be expensive it's just 1 
row fetch with spec filter?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org
For additional commands, e-mail: gitbox-h...@hive.apache.org

Reply via email to