I did a quick search in Parquet-MR and found at least one place where different files are explicitly forbidden [1]. I don't know if this blocks all reading or is a specific case (I'm not sure if writing is allowed for multiple columns).
Like I said, it makes sense, but is potentially a big change code wise (and it appears at least in some cases would require updates across at least the C++ and parquet-mr implementations). [1] https://github.com/apache/parquet-mr/blob/2589cc821d2d470be1e79b86f511eb1f5fee4e5c/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L1242 On Fri, Jul 17, 2020 at 9:48 AM Jacques Nadeau <jacq...@apache.org> wrote: > I believe the formal Parquet standard already allows a file per column. At > least I remember it being discussed when the spec was first implemented. If > you look at the thrift spec it actually allows for this: > > > https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L771 > > That being said, I'm not sure which readers support this read pattern. If > it is part of the spec, doing it as a Parquet writing mode makes sense to > me. > > On Mon, Jul 13, 2020 at 11:08 PM Roman Karlstetter < > roman.karlstet...@gmail.com> wrote: > > > > I'd suggest a new write pattern. Write the columns page at a time to > > separate files then use a second process to concatenate the columns and > > append the footer. Odds are you would do better than os swapping and take > > memory requirements down to > > page size times field count. > > > > This is exactly what a student of us implemented pretty successfully: > > writing to one file per column (non-parquet, binary, memory-mapped). And > > once enough data is put into those "cache/buffer-files", the data is > > flushed to a parquet rowgroup. > > > > My question targeted the integration of these ideas into the arrow > parquet > > writer. I wanted to know whether it makes sense to integrate these ideas > or > > whether it's better to keep that functionality outside of arrow/parquet. > > Having it inside would have the benefit of reduced storage space because > of > > encoding/compression and thus smaller overhead in the final copy phase > > (less data to copy and data already encoded/compressed). But on the other > > hand, having one memory mapped file per column is not something that > seems > > to fit well with the current design of arrow. > > > > Thanks for the feedback, > > Roman > > > > Am So., 12. Juli 2020 um 03:05 Uhr schrieb Micah Kornfield < > > emkornfi...@gmail.com>: > > > >> This is an interesting idea. For s3 multipart uploads one might run > into > >> limitations pretty quickly (only 10k parts appear to be supported. all > but > >> the last are expected to be at least 5mb if I read their docs correctly > >> [1]) > >> > >> [1] https://docs.aws.amazon.com/AmazonS3/latest/dev/qfacts.html > >> > >> > >> On Saturday, July 11, 2020, Jacques Nadeau <jacq...@apache.org> wrote: > >> > >> > I'd suggest a new write pattern. Write the columns page at a time to > >> > separate files then use a second process to concatenate the columns > and > >> > append the footer. Odds are you would do better than os swapping and > >> take > >> > memory requirements down to page size times field count. > >> > > >> > In s3 I believe you could do this via a multipart upload and entirely > >> skip > >> > the second step. I don't know of any implementations that actually do > >> this > >> > yet. > >> > > >> > On Thu, Jul 9, 2020, 11:58 PM Roman Karlstetter < > >> > roman.karlstet...@gmail.com> wrote: > >> > > >> >> Hi, > >> >> > >> >> I wasn't aware of the fact that jemalloc mmap automatically for > larger > >> >> allocations. And I didn't yet test this. > >> >> > >> >> The approach could be different in that we would know which parts of > >> the > >> >> buffers are going to be used next (the buffers are appendonly) and > >> which > >> >> parts won't be needed until actually flushing the rowgroup (and when > >> >> flushing, we also know the order). But I'm not sure whether that > >> knowledge > >> >> helps a lot in a) saving memory compared to a generic allocator or b) > >> >> improving performance. In addition to that, communicating this > >> knowledge > >> >> to > >> >> the implementation will also be tricky for the general case, I guess. > >> >> > >> >> Regarding setting the allocator to another memory pool: I was unsure > >> >> whether the memory pool is used for further allocations where the > >> default > >> >> memory pool would be more appropriate. If not, then setting the > memory > >> >> pool > >> >> in the writer properties should actually work well. > >> >> > >> >> Maybe I should just play a bit with the different memory pool options > >> and > >> >> see how they behave. It makes more sense to discuss further ideas > once > >> I > >> >> have some performance numbers. > >> >> > >> >> Thanks, > >> >> Roman > >> >> > >> >> > >> >> Am Fr., 10. Juli 2020 um 06:47 Uhr schrieb Micah Kornfield < > >> >> emkornfi...@gmail.com>: > >> >> > >> >> > +parquet-dev as this seems more concerned with the non-arrow pieces > >> of > >> >> > parquet > >> >> > > >> >> > Hi Roman, > >> >> > Answers inline. > >> >> > > >> >> > One way to solve that problem would be to use memory mapped files > >> >> instead > >> >> > > of plain memory buffers. That way, the number of required memory > >> can > >> >> be > >> >> > > limited by the number of columns times the os-pagesize, which > >> would be > >> >> > > independent of the rowgroup-size. Consequently, large > rowgroupsizes > >> >> pose > >> >> > no > >> >> > > problem with respect to RAM consumption. > >> >> > > >> >> > I was under the impression that modern allocator (i.e. jemalloc) > >> already > >> >> > mmap for large allocations. How would this approach be different > >> from > >> >> the > >> >> > way allocators use it? Have you prototyped this approach to see if > >> it > >> >> > allows for better scalability? > >> >> > > >> >> > > >> >> > > After a quick look at how the buffers are managed inside arrow > >> >> (allocated > >> >> > > from a default memory pool), I have the impression that an > >> >> implementation > >> >> > > of this idea could be a rather huge change. I still wanted to > know > >> >> > whether > >> >> > > that is something you could see being integrated or whether that > is > >> >> out > >> >> > of > >> >> > > scope of arrow. > >> >> > > >> >> > > >> >> > A huge change probably isn't a great idea unless we've validated > the > >> >> > approach along with alternatives. Is there currently code that > >> doesn't > >> >> > make use of the MemoryPool [1] provided by WriterProperties? If so > we > >> >> > should probably fix it. Otherwise, is there a reason that you > can't > >> >> > substitute a customized memory pool on WriterProperties? > >> >> > > >> >> > Thanks, > >> >> > Micah > >> >> > > >> >> > [1] > >> >> > > >> >> > > https://github.com/apache/arrow/blob/5602c459eb8773b6be8059b1b11817 > >> >> 5e9f16b7a3/cpp/src/parquet/properties.h#L447 > >> >> > > >> >> > On Thu, Jul 9, 2020 at 8:35 AM Roman Karlstetter < > >> >> > roman.karlstet...@gmail.com> wrote: > >> >> > > >> >> > > Hi everyone, > >> >> > > > >> >> > > since some time now, parquet::ParquetFileWriter has the option to > >> >> create > >> >> > > buffered rowgroups with AppendBufferedRowGroup(), which basically > >> >> gives > >> >> > you > >> >> > > the possibility to write to columns in any order you like (in > >> >> contrast to > >> >> > > the former only possible way of writing one column after the > >> other). > >> >> This > >> >> > > is cool since it avoids the caller from having to create an in > >> memory > >> >> > > columnar representation of its data. > >> >> > > > >> >> > > However, when data size is huge compared to the available system > >> >> memory > >> >> > > (due to wide schema or a large rowgroupsize), this is > problematic, > >> as > >> >> the > >> >> > > buffers allocated internally can take up a large portion of RAM > of > >> the > >> >> > > machine the conversion is running on. > >> >> > > > >> >> > > One way to solve that problem would be to use memory mapped files > >> >> instead > >> >> > > of plain memory buffers. That way, the number of required memory > >> can > >> >> be > >> >> > > limited by the number of columns times the os-pagesize, which > >> would be > >> >> > > independent of the rowgroup-size. Consequently, large > rowgroupsizes > >> >> pose > >> >> > no > >> >> > > problem with respect to RAM consumption. > >> >> > > > >> >> > > I wonder what you generally think about the idea of integrating > an > >> >> > > AppendFileBufferedRowGroup() (or similar name) possibility which > >> gives > >> >> > the > >> >> > > user the option to have the internal buffers be memory mapped > >> files. > >> >> > > > >> >> > > After a quick look at how the buffers are managed inside arrow > >> >> (allocated > >> >> > > from a default memory pool), I have the impression that an > >> >> implementation > >> >> > > of this idea could be a rather huge change. I still wanted to > know > >> >> > whether > >> >> > > that is something you could see being integrated or whether that > is > >> >> out > >> >> > of > >> >> > > scope of arrow. > >> >> > > > >> >> > > Thanks in advance and kind regards, > >> >> > > Roman > >> >> > > > >> >> > > >> >> > >> > > >> > > >