The constant DEFAULT_MAX_ROW_GROUP_LENGTH is for
parquet::WriterProperties::max_row_group_length and the unit here is #
of rows.  This is used by parquet::ParquetFileWriter.  The
parquet::StreamWriter class wraps an instance of a file writer and
adds the property MaxRowGroupSize.  This units for MaxRowGroupSize is
indeed bytes.

The max_row_group_length property is only applied when calling
ParquetFileWriter::WriteTable.  The stream writer operates at a lower
level and never calls this method.  So the stream writer should never
be affected by the max_row_group_length property.

One thing to keep in mind is that MaxRowGroupSize is an estimate only.
With certain encodings it can be rather difficult to know ahead of
time how many bytes you will end up writing unless you separate the
encoding step from the write step (which would require an extra memcpy
I think).  In practice I think the estimators are conservative so you
will usually end up with something slightly smaller than 512MB.  If it
is significantly smaller you may need to investigate how effective
your encodings are and see if that is the cause.

On Fri, Aug 26, 2022 at 4:51 AM Arun Joseph <ajos...@gmail.com> wrote:
>
> Hi all,
>
> My understanding of the StreamWriter class is that it would persist Row 
> Groups to disk once they exceed a certain size. In the documentation, it 
> seems like this size is 512MB, but if I look at 
> arrow/include/parquet/properties.h, the DEFAULT_MAX_ROW_GROUP_LENGTH seems to 
> be 64MB. Is this reset to 512MB elsewhere? My parquet version is
>
> #define CREATED_BY_VERSION "parquet-cpp-arrow version 9.0.0-SNAPSHOT
>
> Thank You,
> Arun Joseph

Reply via email to