Hi David, This is one solution for consolidating parquet files but as long as one is rewriting them consolidating small row groups could also make sense. It is also worth noting: 1. I don't think this is what the original poster was looking for since reading the files involves decompression and decoding. 2. There isn't a guarantee that encoding and compression will stay the same between different versions/implementations of parquet (e.g. I think there are different thresholds for dictionary encoding).
On Friday, October 15, 2021, Lee, David <david....@blackrock.com> wrote: > Here was my solution back in 2018.. It's easier to do now with pyarrow's > python APIs than Spark.. > > > https://stackoverflow.com/questions/39187622/how-do-you-control-the-size-of-the-output-file/51216145#51216145 > > Read all the smaller files in your list one at a time and write them to > the temp file as parquet ROW GROUP. It is very important to write each file > in as a row group which preserves compression encoding and guarantees the > amount of bytes (minus schema metadata) written will be the same as the > original file size. > > -----Original Message----- > From: Lee, David > Sent: Friday, October 15, 2021 2:04 PM > To: dev@parquet.apache.org; 'emkornfi...@gmail.com' <emkornfi...@gmail.com>; > david....@blackrock.com.invalid > Subject: RE: Concatenation of parquet files > > Well this is right and wrong.. There is one footer, but the statistics are > captured per row group which allows rowgroups to be easily concatenated > into a new file without rebuiliding column stats. > > The final file looks more like: > > > > > ROW GROUP A1 > > > > ROW GROUP A2 > > > > ROW GROUP B1 > > > > ROW GROUP B2 > > > > FOOTER A1, A2, B1, B2 > > > http://cloudsqale.com/2020/05/29/how-parquet-files-are-written-row-groups-pages-required-memory-and-flush-operations/ > > When all the row groups are written and before the closing the file the > Parquet writer adds the footer to the end of the file. > > The footer includes the file schema (column names and their types) as well > as details about every row group (total size, number of rows, min/max > statistics, number of NULL values for every column). > > Note that this column statistics is per row group, not for the entire file. > > -----Original Message----- > From: Micah Kornfield <emkornfi...@gmail.com> > Sent: Friday, October 15, 2021 1:40 PM > To: david....@blackrock.com.invalid > Cc: dev@parquet.apache.org > Subject: Re: Concatenation of parquet files > > External Email: Use caution with links and attachments > > > Hi David, > I'm not sure I understand. Concatenating files like this would likely > break things. In particular in the example: > > > > Merged: > > > > ROW GROUP A1 > > > > FOOTER A1 > > > > ROW GROUP A2 > > > > FOOTER A2 > > > > ROW GROUP B1 > > > > FOOTER B1 > > > > ROW GROUP B2 > > > > FOOTER B2 > > > There should only be one footer per file, otherwise, I don't think there > is any means of discovering the A row groups. Also, without rewriting > metadata file offsets of B would be wrong ( > https://urldefense.com/v3/__https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift*L790__;Iw!!KSjYCgUGsB4!INGxroC5Q9scoC02spExuMiY-UGGa6F9mlnA-60rpFo3zVyAt4awQpe2iHFUQMQxT14$ > ). > > > https://urldefense.com/v3/__https://arrow.apache.org/docs/python/parquet.html*finer-grained-reading-and-writing__;Iw!!KSjYCgUGsB4!INGxroC5Q9scoC02spExuMiY-UGGa6F9mlnA-60rpFo3zVyAt4awQpe2iHFU3_tBajY$ > > "We can similarly write a Parquet file with multiple row groups by > > using ParquetWriter" > > > Multiple row groups are fine. Combining them after the fact by simple > file concatenation (which is what i understand the original question to be) > would yield incorrect results. If you reread small files and write them > out again in one pass, that would be fine. > > Cheers, > Micah > > On Fri, Oct 15, 2021 at 1:29 PM Lee, David <david....@blackrock.com > .invalid> > wrote: > > > Each row group should have its own statistics footer or dictionary.. > > Your file structure should look like this: > > > > > > *contents of parquet file A:* > > > > ROW GROUP A1 > > > > FOOTER A1 > > > > ROW GROUP A2 > > > > FOOTER A2 > > > > > > > > *contents of parquet file B:* > > > > ROW GROUP B1 > > > > FOOTER B1 > > > > ROW GROUP B2 > > > > FOOTER B2 > > > > Merged: > > > > ROW GROUP A1 > > > > FOOTER A1 > > > > ROW GROUP A2 > > > > FOOTER A2 > > > > ROW GROUP B1 > > > > FOOTER B1 > > > > ROW GROUP B2 > > > > FOOTER B2 > > > > I frequently concatenate smaller parquet files by appending rowgroups > > until I hit an optimal 125 meg file size for HDFS. > > > > > > https://urldefense.com/v3/__https://arrow.apache.org/docs/python/parqu > > et.html*finer-grained-reading-and-writing__;Iw!!KSjYCgUGsB4!INGxroC5Q9 > > scoC02spExuMiY-UGGa6F9mlnA-60rpFo3zVyAt4awQpe2iHFU3_tBajY$ > > "We can similarly write a Parquet file with multiple row groups by > > using ParquetWriter" > > > > -----Original Message----- > > From: Pau Tallada <tall...@pic.es> > > Sent: Tuesday, September 14, 2021 6:01 AM > > To: dev@parquet.apache.org > > Subject: Re: Concatenation of parquet files > > > > External Email: Use caution with links and attachments > > > > > > Dear Gabor, > > > > Thanks a lot for the clarification! ☺ > > I understand this is not a common use case, I somewhat just had hope > > it could be done easily :P > > > > If you are interested, I attach a collab notebook where it shows this > > behaviour. The same data written three times produces different binary > > contents. > > > > https://urldefense.com/v3/__https://colab.research.google.com/drive/1z > > 7VFeEagWk-YAfi4W1CioKUNh0OheQ9f?usp=sharing__;!!KSjYCgUGsB4!Jzx-9D-0Fe > > 2aFLJ5YPThSjNeLFi-BGs-mr0kmvpew1AC2er-i3m1NCRGGRyXqWt1evQ$ > > > > Thanks again and best regards, > > > > Pau > > > > Missatge de Gabor Szadovszky <ga...@apache.org> del dia dt., 14 de set. > > 2021 a les 10:54: > > > > > Hi Pau, > > > > > > I guess attachments are not allowed in the apache lists so we cannot > > > see the image. > > > > > > If the two row groups contain the very same data in the same order > > > and encoded with the same encoding, compressed with the same codec I > > > think, they should be the same binary. I am not sure why you have > > > different binary streams for these row groups but if the proper data > > > can be decoded from both row groups I would not spend too much time > > > on > > it. > > > > > > About merging row groups. It is a tough issue and far not that > > > simple as concatenating the row groups (files) and creating a new > footer. > > > There are statistics in the footer that you have to take care about > > > as well as column indexes and bloom filters that are not part of the > > > footer and neither the row groups. (They are written in separate > > > data structures before the > > > footer.) > > > If you don't want to decode the row groups these statistics can be > > > updated (with the new offsets) as well as the new footer can be > > > created by reading the original footers only. The problem here is > > > creating such a parquet file is not very useful in most cases. Most > > > of the problems come from many small row groups (in small files) > > > which cannot be solved this way. To solve the small files problem we > > > need to merge the row groups and for that we need to decode the > > > original data so we can re-create the statistics (at least for bloom > filters). > > > > > > Long story short, theoretically it is solvable but it is a feature > > > we haven't implemented properly so far. > > > > > > Cheers, > > > Gabor > > > > > > On Tue, Sep 14, 2021 at 10:08 AM Pau Tallada <tall...@pic.es> wrote: > > > > > > > Hi, > > > > > > > > I am a developer of cosmohub.pic.es, a science platform that > > > > provides interactive analysis and exploration of large scientific > > datasets. > > > Working > > > > with Hive, users are able to generate the subset of data they are > > > > interested in, and this result set is stored as a set of files. > > > > When > > > users > > > > want to download this dataset, we combine/concatenate all the > > > > files on-the-fly to generate a single stream that gets downloaded. > > > > Done right, this is very efficient, avoids materializing the > > > > combined file and the stream is even seekable so downloads can be > > > > resumed. We are able to do > > > this > > > > for csv.bz2 and FITS formats. > > > > > > > > I am trying to do the same with parquet. Looking at the format > > > > specification, it seems that it could be done by simply > > > > concatenating the binary blobs of the set of row groups and > > > > generating a new footer for the merged file. The problem is that > > > > the same data, written twice in the same file (in two row groups), > > > > is represented with some differences in the binary stream produced > > > > (see attached image). Why is the binary representation of a row > > > > group different if the data is the same? Is the order or position > > > > of a row > > group codified inside its metadata? > > > > > > > > I attach the image of a parquet file with the same data (a single > > > > integer column named 'c' with a single value 0) written twice, > > > > with at least two differences marked in red and blue. > > > > [image: image.png] > > > > > > > > > > > > A little diagram to show what I'm trying to accomplish: > > > > > > > > *contents of parquet file A:* > > > > PAR1 > > > > ROW GROUP A1 > > > > ROW GROUP A2 > > > > FOOTER A > > > > > > > > *contents of parquet file B:* > > > > PAR1 > > > > ROW GROUP B1 > > > > ROW GROUP B2 > > > > FOOTER B > > > > > > > > If I'm not mistaken, there is no metadata in each row group that > > > > refers > > > to > > > > its file or its position, so they should be relocatable. The final > > > > file/stream would look like this: > > > > > > > > *contents of combined parquet file:* > > > > PAR1 > > > > ROW GROUP A1 > > > > ROW GROUP A2 > > > > ROW GROUP B1 > > > > ROW GROUP B2 > > > > NEW FOOTER A+B > > > > > > > > Thanks a lot in advance for the help understanding this, > > > > > > > > Best regards, > > > > > > > > Pau. > > > > -- > > > > ---------------------------------- > > > > Pau Tallada Crespí > > > > Departament de Serveis > > > > Port d'Informació Científica (PIC) > > > > Tel: +34 93 170 2729 > > > > ---------------------------------- > > > > > > > > > > > > > > > > > -- > > ---------------------------------- > > Pau Tallada Crespí > > Departament de Serveis > > Port d'Informació Científica (PIC) > > Tel: +34 93 170 2729 > > ---------------------------------- > > > > > > This message may contain information that is confidential or privileged. > > If you are not the intended recipient, please advise the sender > > immediately and delete this message. See > > http://www.blackrock.com/corporate/compliance/email-disclaimers for > > further information. Please refer to > > http://www.blackrock.com/corporate/compliance/privacy-policy for more > > information about BlackRock’s Privacy Policy. > > > > > > For a list of BlackRock's office addresses worldwide, see > > http://www.blackrock.com/corporate/about-us/contacts-locations. > > > > © 2021 BlackRock, Inc. All rights reserved. > > >