Re: Writing very large rowgroups to Apache Parquet

Micah Kornfield Fri, 17 Jul 2020 09:59:14 -0700

I did a quick search in Parquet-MR and found at least one place where
different files are explicitly forbidden [1].  I don't know if this blocks
all reading or is a specific case (I'm not sure if writing is allowed for
multiple columns).


Like I said, it makes sense, but is potentially a big change code wise (and
it appears at least in some cases would require updates across at least the
C++ and parquet-mr implementations).

[1]
https://github.com/apache/parquet-mr/blob/2589cc821d2d470be1e79b86f511eb1f5fee4e5c/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L1242

On Fri, Jul 17, 2020 at 9:48 AM Jacques Nadeau <jacq...@apache.org> wrote:

> I believe the formal Parquet standard already allows a file per column. At
> least I remember it being discussed when the spec was first implemented. If
> you look at the thrift spec it actually allows for this:
>
>
> https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L771
>
> That being said, I'm not sure which readers support this read pattern. If
> it is part of the spec, doing it as a Parquet writing mode makes sense to
> me.
>
> On Mon, Jul 13, 2020 at 11:08 PM Roman Karlstetter <
> roman.karlstet...@gmail.com> wrote:
>
> > > I'd suggest a new write pattern. Write the columns page at a time to
> > separate files then use a second process to concatenate the columns and
> > append the footer. Odds are you would do better than os swapping and take
> > memory requirements down to
> > page size times field count.
> >
> > This is exactly what a student of us implemented pretty successfully:
> > writing to one file per column (non-parquet, binary, memory-mapped). And
> > once enough data is put into those "cache/buffer-files", the data is
> > flushed to a parquet rowgroup.
> >
> > My question targeted the integration of these ideas into the arrow
> parquet
> > writer. I wanted to know whether it makes sense to integrate these ideas
> or
> > whether it's better to keep that functionality outside of arrow/parquet.
> > Having it inside would have the benefit of reduced storage space because
> of
> > encoding/compression and thus smaller overhead in the final copy phase
> > (less data to copy and data already encoded/compressed). But on the other
> > hand, having one memory mapped file per column is not something that
> seems
> > to fit well with the current design of arrow.
> >
> > Thanks for the feedback,
> > Roman
> >
> > Am So., 12. Juli 2020 um 03:05 Uhr schrieb Micah Kornfield <
> > emkornfi...@gmail.com>:
> >
> >> This is an interesting idea.  For s3 multipart uploads one might run
> into
> >> limitations pretty quickly (only 10k parts appear to be supported. all
> but
> >> the last are expected to be at least 5mb if I read their docs correctly
> >> [1])
> >>
> >> [1] https://docs.aws.amazon.com/AmazonS3/latest/dev/qfacts.html
> >>
> >>
> >> On Saturday, July 11, 2020, Jacques Nadeau <jacq...@apache.org> wrote:
> >>
> >> > I'd suggest a new write pattern. Write the columns page at a time to
> >> > separate files then use a second process to concatenate the columns
> and
> >> > append the footer. Odds are you would do better than os swapping and
> >> take
> >> > memory requirements down to page size times field count.
> >> >
> >> > In s3 I believe you could do this via a multipart upload and entirely
> >> skip
> >> > the second step. I don't know of any implementations that actually do
> >> this
> >> > yet.
> >> >
> >> > On Thu, Jul 9, 2020, 11:58 PM Roman Karlstetter <
> >> > roman.karlstet...@gmail.com> wrote:
> >> >
> >> >> Hi,
> >> >>
> >> >> I wasn't aware of the fact that jemalloc mmap automatically for
> larger
> >> >> allocations. And I didn't yet test this.
> >> >>
> >> >> The approach could be different in that we would know which parts of
> >> the
> >> >> buffers are going to be used next (the buffers are appendonly) and
> >> which
> >> >> parts won't be needed until actually flushing the rowgroup (and when
> >> >> flushing, we also know the order). But I'm not sure whether that
> >> knowledge
> >> >> helps a lot in a) saving memory compared to a generic allocator or b)
> >> >> improving performance. In addition to that, communicating this
> >> knowledge
> >> >> to
> >> >> the implementation will also be tricky for the general case, I guess.
> >> >>
> >> >> Regarding setting the allocator to another memory pool: I was unsure
> >> >> whether the memory pool is used for further allocations where the
> >> default
> >> >> memory pool would be more appropriate. If not, then setting the
> memory
> >> >> pool
> >> >> in the writer properties should actually work well.
> >> >>
> >> >> Maybe I should just play a bit with the different memory pool options
> >> and
> >> >> see how they behave. It makes more sense to discuss further ideas
> once
> >> I
> >> >> have some performance numbers.
> >> >>
> >> >> Thanks,
> >> >> Roman
> >> >>
> >> >>
> >> >> Am Fr., 10. Juli 2020 um 06:47 Uhr schrieb Micah Kornfield <
> >> >> emkornfi...@gmail.com>:
> >> >>
> >> >> > +parquet-dev as this seems more concerned with the non-arrow pieces
> >> of
> >> >> > parquet
> >> >> >
> >> >> > Hi Roman,
> >> >> > Answers inline.
> >> >> >
> >> >> > One way to solve that problem would be to use memory mapped files
> >> >> instead
> >> >> > > of plain memory buffers. That way, the number of required memory
> >> can
> >> >> be
> >> >> > > limited by the number of columns times the os-pagesize, which
> >> would be
> >> >> > > independent of the rowgroup-size. Consequently, large
> rowgroupsizes
> >> >> pose
> >> >> > no
> >> >> > > problem with respect to RAM consumption.
> >> >> >
> >> >> > I was under the impression that modern allocator (i.e. jemalloc)
> >> already
> >> >> > mmap for large allocations.  How would this approach be different
> >> from
> >> >> the
> >> >> > way allocators use it?  Have you prototyped this approach to see if
> >> it
> >> >> > allows for better scalability?
> >> >> >
> >> >> >
> >> >> > > After a quick look at how the buffers are managed inside arrow
> >> >> (allocated
> >> >> > > from a default memory pool), I have the impression that an
> >> >> implementation
> >> >> > > of this idea could be a rather huge change. I still wanted to
> know
> >> >> > whether
> >> >> > > that is something you could see being integrated or whether that
> is
> >> >> out
> >> >> > of
> >> >> > > scope of arrow.
> >> >> >
> >> >> >
> >> >> > A huge change probably isn't a great idea unless we've validated
> the
> >> >> > approach along with alternatives.  Is there currently code that
> >> doesn't
> >> >> > make use of the MemoryPool [1] provided by WriterProperties? If so
> we
> >> >> > should probably fix it.  Otherwise, is there a reason that you
> can't
> >> >> > substitute a customized memory pool on WriterProperties?
> >> >> >
> >> >> > Thanks,
> >> >> > Micah
> >> >> >
> >> >> > [1]
> >> >> >
> >> >> >
> https://github.com/apache/arrow/blob/5602c459eb8773b6be8059b1b11817
> >> >> 5e9f16b7a3/cpp/src/parquet/properties.h#L447
> >> >> >
> >> >> > On Thu, Jul 9, 2020 at 8:35 AM Roman Karlstetter <
> >> >> > roman.karlstet...@gmail.com> wrote:
> >> >> >
> >> >> > > Hi everyone,
> >> >> > >
> >> >> > > since some time now, parquet::ParquetFileWriter has the option to
> >> >> create
> >> >> > > buffered rowgroups with AppendBufferedRowGroup(), which basically
> >> >> gives
> >> >> > you
> >> >> > > the possibility to write to columns in any order you like (in
> >> >> contrast to
> >> >> > > the former only possible way of writing one column after the
> >> other).
> >> >> This
> >> >> > > is cool since it avoids the caller from having to create an in
> >> memory
> >> >> > > columnar representation of its data.
> >> >> > >
> >> >> > > However, when data size is huge compared to the available system
> >> >> memory
> >> >> > > (due to wide schema or a large rowgroupsize), this is
> problematic,
> >> as
> >> >> the
> >> >> > > buffers allocated internally can take up a large portion of RAM
> of
> >> the
> >> >> > > machine the conversion is running on.
> >> >> > >
> >> >> > > One way to solve that problem would be to use memory mapped files
> >> >> instead
> >> >> > > of plain memory buffers. That way, the number of required memory
> >> can
> >> >> be
> >> >> > > limited by the number of columns times the os-pagesize, which
> >> would be
> >> >> > > independent of the rowgroup-size. Consequently, large
> rowgroupsizes
> >> >> pose
> >> >> > no
> >> >> > > problem with respect to RAM consumption.
> >> >> > >
> >> >> > > I wonder what you generally think about the idea of integrating
> an
> >> >> > > AppendFileBufferedRowGroup() (or similar name) possibility which
> >> gives
> >> >> > the
> >> >> > > user the option to have the internal buffers be memory mapped
> >> files.
> >> >> > >
> >> >> > > After a quick look at how the buffers are managed inside arrow
> >> >> (allocated
> >> >> > > from a default memory pool), I have the impression that an
> >> >> implementation
> >> >> > > of this idea could be a rather huge change. I still wanted to
> know
> >> >> > whether
> >> >> > > that is something you could see being integrated or whether that
> is
> >> >> out
> >> >> > of
> >> >> > > scope of arrow.
> >> >> > >
> >> >> > > Thanks in advance and kind regards,
> >> >> > > Roman
> >> >> > >
> >> >> >
> >> >>
> >> >
> >>
> >
>

Re: Writing very large rowgroups to Apache Parquet

Reply via email to