The constant DEFAULT_MAX_ROW_GROUP_LENGTH is for parquet::WriterProperties::max_row_group_length and the unit here is # of rows. This is used by parquet::ParquetFileWriter. The parquet::StreamWriter class wraps an instance of a file writer and adds the property MaxRowGroupSize. This units for MaxRowGroupSize is indeed bytes.
The max_row_group_length property is only applied when calling ParquetFileWriter::WriteTable. The stream writer operates at a lower level and never calls this method. So the stream writer should never be affected by the max_row_group_length property. One thing to keep in mind is that MaxRowGroupSize is an estimate only. With certain encodings it can be rather difficult to know ahead of time how many bytes you will end up writing unless you separate the encoding step from the write step (which would require an extra memcpy I think). In practice I think the estimators are conservative so you will usually end up with something slightly smaller than 512MB. If it is significantly smaller you may need to investigate how effective your encodings are and see if that is the cause. On Fri, Aug 26, 2022 at 4:51 AM Arun Joseph <ajos...@gmail.com> wrote: > > Hi all, > > My understanding of the StreamWriter class is that it would persist Row > Groups to disk once they exceed a certain size. In the documentation, it > seems like this size is 512MB, but if I look at > arrow/include/parquet/properties.h, the DEFAULT_MAX_ROW_GROUP_LENGTH seems to > be 64MB. Is this reset to 512MB elsewhere? My parquet version is > > #define CREATED_BY_VERSION "parquet-cpp-arrow version 9.0.0-SNAPSHOT > > Thank You, > Arun Joseph