kszucs commented on code in PR #47090:
URL: https://github.com/apache/arrow/pull/47090#discussion_r2234083891


##########
cpp/src/parquet/properties.h:
##########
@@ -155,6 +155,7 @@ class PARQUET_EXPORT ReaderProperties {
 ReaderProperties PARQUET_EXPORT default_reader_properties();
 
 static constexpr int64_t kDefaultDataPageSize = 1024 * 1024;
+static constexpr int64_t kDefaultMaxRowsPerPage = 20'000;

Review Comment:
   Maybe we should have this feature as opt-in? Otherwise should we choose a 
bigger value to have the data page size triggered first?
   
   Given the parquet type sizes we could end up much smaller data pages (even 
before encoding) than 1MB which could be unexpected to the users and also 
increase the overall metadata size:
   ```
     - BOOLEAN: 1 bit boolean
     - INT32: 32 bit signed ints
     - INT64: 64 bit signed ints
     - INT96: 96 bit signed ints (deprecated; only used by legacy 
implementations)
     - FLOAT: IEEE 32-bit floating point values
     - DOUBLE: IEEE 64-bit floating point values
     - BYTE_ARRAY: arbitrarily long byte arrays
     - FIXED_LEN_BYTE_ARRAY: fixed length byte arrays
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to