Re: [C++] How often does Parquet StreamWriter flush to disk?

2022-08-26 Thread Weston Pace
> Does that align with your understanding? If so, then wouldn't the > MaxRowGroupSize affect memory usage when writing? Not really. I would expect the writer to write a column to disk as soon as it has accumulated enough data to fill a data page (if not sooner). I'm not sure why it would need

Re: [C++] How often does Parquet StreamWriter flush to disk?

2022-08-26 Thread Arun Joseph
Hi Weston, >From my understanding, if I'm writing out multiple Gbs of data to disk via StreamWriter, and if the MaxRowGroupSize is defaulted to 512MB, my hypothetical memory usage should be (assuming a single writer): ~= buffer[read data size] + ~MaxRowGroupSize + SUM(# of RowGroups * SizeOf(Row

Re: [C++] How often does Parquet StreamWriter flush to disk?

2022-08-26 Thread Weston Pace
If your goal is to save memory when writing then I wouldn't expect the MaxRowGroupSize to have much effect actually. However, I have not really studied the parquet writer in depth, so this is theoretical based on the format. Columns in a parquet file are written in row groups, which has a length

Re: [C++] How often does Parquet StreamWriter flush to disk?

2022-08-26 Thread Arun Joseph
Hi Weston, Thank you for the clarification! The default 512MB, and the slightly smaller writes align with what I've been seeing and after using SetMaxRowGroupSize to change the MaxRowGroupSize, I am seeing the expected behavior with smaller values. In terms of the implications of setting a

Re: [C++] How often does Parquet StreamWriter flush to disk?

2022-08-26 Thread Weston Pace
The constant DEFAULT_MAX_ROW_GROUP_LENGTH is for parquet::WriterProperties::max_row_group_length and the unit here is # of rows. This is used by parquet::ParquetFileWriter. The parquet::StreamWriter class wraps an instance of a file writer and adds the property MaxRowGroupSize. This units for

[C++] How often does Parquet StreamWriter flush to disk?

2022-08-26 Thread Arun Joseph
Hi all, My understanding of the StreamWriter class is that it would persist Row Groups to disk once they exceed a certain size. In the documentation , it seems like this size is 512MB, but if I look at