Re: [PR] Allow setting `{row-group,page}` limit [iceberg-python]

via GitHub Wed, 07 Feb 2024 08:08:55 -0800


amogh-jahagirdar commented on code in PR #390:
URL: https://github.com/apache/iceberg-python/pull/390#discussion_r1481728511



##########
pyiceberg/table/__init__.py:
##########
@@ -134,6 +133,53 @@
 _JAVA_LONG_MAX = 9223372036854775807
 
 
+class TableProperties:
+    PARQUET_ROW_GROUP_SIZE_BYTES = "write.parquet.row-group-size-bytes"
+    PARQUET_ROW_GROUP_SIZE_BYTES_DEFAULT = 128 * 1024 * 1024  # 128 MB
+
+    PARQUET_ROW_GROUP_LIMIT = "write.parquet.row-group-limit"
+    PARQUET_ROW_GROUP_LIMIT_DEFAULT = 128 * 1024 * 1024  # 128 MB
+
+    PARQUET_PAGE_SIZE_BYTES = "write.parquet.page-size-bytes"
+    PARQUET_PAGE_SIZE_BYTES_DEFAULT = 1024 * 1024  # 1 MB

Review Comment:
   I think this should be fine for the initial PR just so people have the 
properties, but I do think we may want to benchmark these values more. I think 
in general, Arrow and DuckDB will benefit from smaller row group sizes because 
they are more aggressive on parallel reads. But of course we should measure 
that.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Allow setting `{row-group,page}` limit [iceberg-python]

Reply via email to