deniskuzZ commented on PR #5215: URL: https://github.com/apache/hive/pull/5215#issuecomment-2194719003
> > @zhangbutao, do you know if iceberg provides partition row_count stats? https://docs.google.com/document/d/1vaufuD47kMijz97LxM67X8OX-W2Wq7nmlz3jRo8J5Qk if not, maybe we can get it from meta table: > > ``` > > SELECT record_count FROM prod.db.table.partitions where spec_id in (....) > > ``` > > @deniskuzZ After some chek, i found that computing partitions stats by existing partitions metadata api which is used by `SELECT record_count FROM prod.db.table.partitions where spec_id in (....)` is very expensive if table has many data files, like the doc said: > > ![partitions_stats](https://private-user-images.githubusercontent.com/9760681/341370914-c31a739f-9020-4634-8dd2-d9ea8bff44e0.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTk0OTU3MTQsIm5iZiI6MTcxOTQ5NTQxNCwicGF0aCI6Ii85NzYwNjgxLzM0MTM3MDkxNC1jMzFhNzM5Zi05MDIwLTQ2MzQtOGRkMi1kOWVhOGJmZjQ0ZTAucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI0MDYyNyUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNDA2MjdUMTMzNjU0WiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9NDdjNWVhZGMxZWNmMzk5NDQ2ZmUwYzQ4MDljNjFmYTVhZjE4NjZkNmJlMGNkZjJjYjcyNGM1ZTRkMDk2MDhlNCZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QmYWN0b3JfaWQ9MCZrZXlfaWQ9MCZyZXBvX2lkPTAifQ.odIZqauAYGe2CP1yeraK_sdBDfZNgleUbByS2cwJWlU) > > The best way is that we can try to use some new partitons stats api added in Iceberg 1.5.0. We can do the partitions `row count` optimization after upgrading iceberg to 1.5.x. WDYT? i don't know the details, need to check the doc, but. if iceberg exposes record_count via partitions metatable why select would be expensive it's just 1 row fetch with spec filter? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org For additional commands, e-mail: gitbox-h...@hive.apache.org