Re: Writing very large rowgroups to Apache Parquet

Jacques Nadeau Fri, 17 Jul 2020 09:48:24 -0700

I believe the formal Parquet standard already allows a file per column. At
least I remember it being discussed when the spec was first implemented. If
you look at the thrift spec it actually allows for this:


https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L771

That being said, I'm not sure which readers support this read pattern. If
it is part of the spec, doing it as a Parquet writing mode makes sense to
me.

On Mon, Jul 13, 2020 at 11:08 PM Roman Karlstetter <
roman.karlstet...@gmail.com> wrote:

> > I'd suggest a new write pattern. Write the columns page at a time to
> separate files then use a second process to concatenate the columns and
> append the footer. Odds are you would do better than os swapping and take
> memory requirements down to
> page size times field count.
>
> This is exactly what a student of us implemented pretty successfully:
> writing to one file per column (non-parquet, binary, memory-mapped). And
> once enough data is put into those "cache/buffer-files", the data is
> flushed to a parquet rowgroup.
>
> My question targeted the integration of these ideas into the arrow parquet
> writer. I wanted to know whether it makes sense to integrate these ideas or
> whether it's better to keep that functionality outside of arrow/parquet.
> Having it inside would have the benefit of reduced storage space because of
> encoding/compression and thus smaller overhead in the final copy phase
> (less data to copy and data already encoded/compressed). But on the other
> hand, having one memory mapped file per column is not something that seems
> to fit well with the current design of arrow.
>
> Thanks for the feedback,
> Roman
>
> Am So., 12. Juli 2020 um 03:05 Uhr schrieb Micah Kornfield <
> emkornfi...@gmail.com>:
>
>> This is an interesting idea.  For s3 multipart uploads one might run into
>> limitations pretty quickly (only 10k parts appear to be supported. all but
>> the last are expected to be at least 5mb if I read their docs correctly
>> [1])
>>
>> [1] https://docs.aws.amazon.com/AmazonS3/latest/dev/qfacts.html
>>
>>
>> On Saturday, July 11, 2020, Jacques Nadeau <jacq...@apache.org> wrote:
>>
>> > I'd suggest a new write pattern. Write the columns page at a time to
>> > separate files then use a second process to concatenate the columns and
>> > append the footer. Odds are you would do better than os swapping and
>> take
>> > memory requirements down to page size times field count.
>> >
>> > In s3 I believe you could do this via a multipart upload and entirely
>> skip
>> > the second step. I don't know of any implementations that actually do
>> this
>> > yet.
>> >
>> > On Thu, Jul 9, 2020, 11:58 PM Roman Karlstetter <
>> > roman.karlstet...@gmail.com> wrote:
>> >
>> >> Hi,
>> >>
>> >> I wasn't aware of the fact that jemalloc mmap automatically for larger
>> >> allocations. And I didn't yet test this.
>> >>
>> >> The approach could be different in that we would know which parts of
>> the
>> >> buffers are going to be used next (the buffers are appendonly) and
>> which
>> >> parts won't be needed until actually flushing the rowgroup (and when
>> >> flushing, we also know the order). But I'm not sure whether that
>> knowledge
>> >> helps a lot in a) saving memory compared to a generic allocator or b)
>> >> improving performance. In addition to that, communicating this
>> knowledge
>> >> to
>> >> the implementation will also be tricky for the general case, I guess.
>> >>
>> >> Regarding setting the allocator to another memory pool: I was unsure
>> >> whether the memory pool is used for further allocations where the
>> default
>> >> memory pool would be more appropriate. If not, then setting the memory
>> >> pool
>> >> in the writer properties should actually work well.
>> >>
>> >> Maybe I should just play a bit with the different memory pool options
>> and
>> >> see how they behave. It makes more sense to discuss further ideas once
>> I
>> >> have some performance numbers.
>> >>
>> >> Thanks,
>> >> Roman
>> >>
>> >>
>> >> Am Fr., 10. Juli 2020 um 06:47 Uhr schrieb Micah Kornfield <
>> >> emkornfi...@gmail.com>:
>> >>
>> >> > +parquet-dev as this seems more concerned with the non-arrow pieces
>> of
>> >> > parquet
>> >> >
>> >> > Hi Roman,
>> >> > Answers inline.
>> >> >
>> >> > One way to solve that problem would be to use memory mapped files
>> >> instead
>> >> > > of plain memory buffers. That way, the number of required memory
>> can
>> >> be
>> >> > > limited by the number of columns times the os-pagesize, which
>> would be
>> >> > > independent of the rowgroup-size. Consequently, large rowgroupsizes
>> >> pose
>> >> > no
>> >> > > problem with respect to RAM consumption.
>> >> >
>> >> > I was under the impression that modern allocator (i.e. jemalloc)
>> already
>> >> > mmap for large allocations.  How would this approach be different
>> from
>> >> the
>> >> > way allocators use it?  Have you prototyped this approach to see if
>> it
>> >> > allows for better scalability?
>> >> >
>> >> >
>> >> > > After a quick look at how the buffers are managed inside arrow
>> >> (allocated
>> >> > > from a default memory pool), I have the impression that an
>> >> implementation
>> >> > > of this idea could be a rather huge change. I still wanted to know
>> >> > whether
>> >> > > that is something you could see being integrated or whether that is
>> >> out
>> >> > of
>> >> > > scope of arrow.
>> >> >
>> >> >
>> >> > A huge change probably isn't a great idea unless we've validated the
>> >> > approach along with alternatives.  Is there currently code that
>> doesn't
>> >> > make use of the MemoryPool [1] provided by WriterProperties? If so we
>> >> > should probably fix it.  Otherwise, is there a reason that you can't
>> >> > substitute a customized memory pool on WriterProperties?
>> >> >
>> >> > Thanks,
>> >> > Micah
>> >> >
>> >> > [1]
>> >> >
>> >> > https://github.com/apache/arrow/blob/5602c459eb8773b6be8059b1b11817
>> >> 5e9f16b7a3/cpp/src/parquet/properties.h#L447
>> >> >
>> >> > On Thu, Jul 9, 2020 at 8:35 AM Roman Karlstetter <
>> >> > roman.karlstet...@gmail.com> wrote:
>> >> >
>> >> > > Hi everyone,
>> >> > >
>> >> > > since some time now, parquet::ParquetFileWriter has the option to
>> >> create
>> >> > > buffered rowgroups with AppendBufferedRowGroup(), which basically
>> >> gives
>> >> > you
>> >> > > the possibility to write to columns in any order you like (in
>> >> contrast to
>> >> > > the former only possible way of writing one column after the
>> other).
>> >> This
>> >> > > is cool since it avoids the caller from having to create an in
>> memory
>> >> > > columnar representation of its data.
>> >> > >
>> >> > > However, when data size is huge compared to the available system
>> >> memory
>> >> > > (due to wide schema or a large rowgroupsize), this is problematic,
>> as
>> >> the
>> >> > > buffers allocated internally can take up a large portion of RAM of
>> the
>> >> > > machine the conversion is running on.
>> >> > >
>> >> > > One way to solve that problem would be to use memory mapped files
>> >> instead
>> >> > > of plain memory buffers. That way, the number of required memory
>> can
>> >> be
>> >> > > limited by the number of columns times the os-pagesize, which
>> would be
>> >> > > independent of the rowgroup-size. Consequently, large rowgroupsizes
>> >> pose
>> >> > no
>> >> > > problem with respect to RAM consumption.
>> >> > >
>> >> > > I wonder what you generally think about the idea of integrating an
>> >> > > AppendFileBufferedRowGroup() (or similar name) possibility which
>> gives
>> >> > the
>> >> > > user the option to have the internal buffers be memory mapped
>> files.
>> >> > >
>> >> > > After a quick look at how the buffers are managed inside arrow
>> >> (allocated
>> >> > > from a default memory pool), I have the impression that an
>> >> implementation
>> >> > > of this idea could be a rather huge change. I still wanted to know
>> >> > whether
>> >> > > that is something you could see being integrated or whether that is
>> >> out
>> >> > of
>> >> > > scope of arrow.
>> >> > >
>> >> > > Thanks in advance and kind regards,
>> >> > > Roman
>> >> > >
>> >> >
>> >>
>> >
>>
>

Re: Writing very large rowgroups to Apache Parquet

Reply via email to