Re: [C++] How often does Parquet StreamWriter flush to disk?

Weston Pace Fri, 26 Aug 2022 16:43:11 -0700

> Does that align with your understanding? If so, then wouldn't the 
> MaxRowGroupSize affect memory usage when writing?


Not really.  I would expect the writer to write a column to disk as
soon as it has accumulated enough data to fill a data page (if not
sooner).  I'm not sure why it would need to buffer up an entire row
group before it starts writing to disk.

> Would an int32 rowgroup have less data pages than an int64 rowgroup?

Yes, each column could have a different number of data pages.

P.S.

As I've recently been reminded, a write is only pushing data from the
process' RSS space into the kernel's page cache.  So if you are
writing a lot of data quickly you may see the system's available RAM
decrease because the page cache is filling up (even though the
process' RSS space remains small).

On Fri, Aug 26, 2022 at 7:48 AM Arun Joseph <[email protected]> wrote:
>
> Hi Weston,
>
> From my understanding, if I'm writing out multiple Gbs of data to disk via 
> StreamWriter, and if the MaxRowGroupSize is defaulted to 512MB, my 
> hypothetical memory usage should be (assuming a single writer):
>
> ~= buffer[read data size] + ~MaxRowGroupSize + SUM(# of RowGroups * 
> SizeOf(Row Group Metadata))
>
> Does that align with your understanding? If so, then wouldn't the 
> MaxRowGroupSize affect memory usage when writing?
>
> Also, while not directly related to what we've been discussing, your 
> explanation of the data pages did raise another question. Since column data 
> types can be different data types, and data pages are a fixed size (e.g. 
> default 1MB), how do mixed-size tables work w.r.t written data pages? Would 
> an int32 rowgroup have less data pages than an int64 rowgroup?
>
> Thank You,
> Arun
>
> On Fri, Aug 26, 2022 at 10:35 AM Weston Pace <[email protected]> wrote:
>>
>> If your goal is to save memory when writing then I wouldn't expect the
>> MaxRowGroupSize to have much effect actually.  However, I have not
>> really studied the parquet writer in depth, so this is theoretical
>> based on the format.
>>
>> Columns in a parquet file are written in row groups, which has a
>> length (# of rows) that all the column chunks in the row group match
>> (i.e. if the row group has a length of 1Mi rows then each column chunk
>> will have 1Mi items).  However, each column chunk is written as a
>> series of data pages.  Data pages are indivisible, so a writer may
>> need to accumulate an entire page's worth of data to persist it to the
>> disk (although, if using a streaming compression algorithm, perhaps
>> this is not required).  Even if this is required a data page is
>> usually quite small.  I believe Arrow defaults a data page to 1MiB.
>> So, at most, I would expect a writer to have to accumulate ~
>> data_pagesize * # columns of RAM.
>>
>> However, I believe the writer also accumulates all row group metadata
>> in memory as well.  I could be wrong on this (perhaps it was the
>> reader) and I don't recall if this is strictly needed (e.g. to
>> populate the footer) or if it is more of a convenience.  This metadata
>> should generally be pretty small but if you shrink the row group size
>> significantly then you might actually see more RAM usage by the
>> parquet writer.
>>
>> If your goal is to save memory when reading then the row group size
>> might matter, depending on how you read the parquet file.  For
>> example, the most common way to read a parquet file in C++-arrow and
>> pyarrow is to read an entire row group all at once.  There are no
>> utilities in pyarrow to read part of a row group or individual data
>> pages, I'm not sure if there are any in C++-arrow or not.  There could
>> be, and I would very much like to see such readers exist someday.  I
>> believe parquet-mr (the Java parquet reader) supports this.  As a
>> result, if a reader doing streaming processing has to read in an
>> entire row group's worth of data, then the row group size will play a
>> large role in how much RAM that streaming reader requires.
>>
>> On Fri, Aug 26, 2022 at 7:05 AM Arun Joseph <[email protected]> wrote:
>> >
>> > Hi Weston,
>> >
>> > Thank you for the clarification! The default 512MB, and the slightly 
>> > smaller writes align with what I've been seeing and after using 
>> > SetMaxRowGroupSize to change the MaxRowGroupSize, I am seeing the expected 
>> > behavior with smaller values.
>> >
>> > In terms of the implications of setting a smaller value for the 
>> > MaxRowGroupSize, is it mainly the increased number of syscalls required to 
>> > persist to disk, or is there anything else that would be a side effect?
>> >
>> > I am particularly interested in keeping my memory usage down, so I'm 
>> > trying to get a better sense of the memory "landscape" that parquet 
>> > utilizes. Once the row group is persisted to disk, the space that the row 
>> > group previously utilized in memory should be freed for use once more 
>> > right?
>> >
>> > Thank You,
>> > Arun Joseph
>> >
>> > On Fri, Aug 26, 2022 at 9:36 AM Weston Pace <[email protected]> wrote:
>> >>
>> >> The constant DEFAULT_MAX_ROW_GROUP_LENGTH is for
>> >> parquet::WriterProperties::max_row_group_length and the unit here is #
>> >> of rows.  This is used by parquet::ParquetFileWriter.  The
>> >> parquet::StreamWriter class wraps an instance of a file writer and
>> >> adds the property MaxRowGroupSize.  This units for MaxRowGroupSize is
>> >> indeed bytes.
>> >>
>> >> The max_row_group_length property is only applied when calling
>> >> ParquetFileWriter::WriteTable.  The stream writer operates at a lower
>> >> level and never calls this method.  So the stream writer should never
>> >> be affected by the max_row_group_length property.
>> >>
>> >> One thing to keep in mind is that MaxRowGroupSize is an estimate only.
>> >> With certain encodings it can be rather difficult to know ahead of
>> >> time how many bytes you will end up writing unless you separate the
>> >> encoding step from the write step (which would require an extra memcpy
>> >> I think).  In practice I think the estimators are conservative so you
>> >> will usually end up with something slightly smaller than 512MB.  If it
>> >> is significantly smaller you may need to investigate how effective
>> >> your encodings are and see if that is the cause.
>> >>
>> >> On Fri, Aug 26, 2022 at 4:51 AM Arun Joseph <[email protected]> wrote:
>> >> >
>> >> > Hi all,
>> >> >
>> >> > My understanding of the StreamWriter class is that it would persist Row 
>> >> > Groups to disk once they exceed a certain size. In the documentation, 
>> >> > it seems like this size is 512MB, but if I look at 
>> >> > arrow/include/parquet/properties.h, the DEFAULT_MAX_ROW_GROUP_LENGTH 
>> >> > seems to be 64MB. Is this reset to 512MB elsewhere? My parquet version 
>> >> > is
>> >> >
>> >> > #define CREATED_BY_VERSION "parquet-cpp-arrow version 9.0.0-SNAPSHOT
>> >> >
>> >> > Thank You,
>> >> > Arun Joseph
>> >
>> >
>> >
>> > --
>> > Arun Joseph
>> >
>
>
>
> --
> Arun Joseph
>

Re: [C++] How often does Parquet StreamWriter flush to disk?

Reply via email to