amogh-jahagirdar commented on code in PR #390: URL: https://github.com/apache/iceberg-python/pull/390#discussion_r1481728511
########## pyiceberg/table/__init__.py: ########## @@ -134,6 +133,53 @@ _JAVA_LONG_MAX = 9223372036854775807 +class TableProperties: + PARQUET_ROW_GROUP_SIZE_BYTES = "write.parquet.row-group-size-bytes" + PARQUET_ROW_GROUP_SIZE_BYTES_DEFAULT = 128 * 1024 * 1024 # 128 MB + + PARQUET_ROW_GROUP_LIMIT = "write.parquet.row-group-limit" + PARQUET_ROW_GROUP_LIMIT_DEFAULT = 128 * 1024 * 1024 # 128 MB + + PARQUET_PAGE_SIZE_BYTES = "write.parquet.page-size-bytes" + PARQUET_PAGE_SIZE_BYTES_DEFAULT = 1024 * 1024 # 1 MB Review Comment: I think this should be fine for the initial PR just so people have the properties, but I do think we may want to benchmark these values more. I think in general, Arrow and DuckDB will benefit from smaller row group sizes because they are more aggressive on parallel reads. But of course we should measure that. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
