RE: Concatenation of parquet files

Micah Kornfield Sat, 23 Oct 2021 22:08:47 -0700

Hi David,
This is one solution for consolidating parquet files but as long as one is
rewriting them consolidating small row groups could also make sense.  It is
also worth noting:
1.  I don't think this is what the original poster was looking for since
reading the files involves decompression and decoding.
2.  There isn't a guarantee that encoding and compression will stay the
same between different versions/implementations of parquet (e.g. I think
there are different thresholds for dictionary encoding).


On Friday, October 15, 2021, Lee, David <david....@blackrock.com> wrote:

> Here was my solution back in 2018.. It's easier to do now with pyarrow's
> python APIs than Spark..
>
>
> https://stackoverflow.com/questions/39187622/how-do-you-control-the-size-of-the-output-file/51216145#51216145
>
> Read all the smaller files in your list one at a time and write them to
> the temp file as parquet ROW GROUP. It is very important to write each file
> in as a row group which preserves compression encoding and guarantees the
> amount of bytes (minus schema metadata) written will be the same as the
> original file size.
>
> -----Original Message-----
> From: Lee, David
> Sent: Friday, October 15, 2021 2:04 PM
> To: dev@parquet.apache.org; 'emkornfi...@gmail.com' <emkornfi...@gmail.com>;
> david....@blackrock.com.invalid
> Subject: RE: Concatenation of parquet files
>
> Well this is right and wrong.. There is one footer, but the statistics are
> captured per row group which allows rowgroups to be easily concatenated
> into a new file without rebuiliding column stats.
>
> The final file looks more like:
>
> > > > ROW GROUP A1
> > > > ROW GROUP A2
> > > > ROW GROUP B1
> > > > ROW GROUP B2
> > > > FOOTER A1, A2, B1, B2
>
>
> http://cloudsqale.com/2020/05/29/how-parquet-files-are-written-row-groups-pages-required-memory-and-flush-operations/
>
> When all the row groups are written and before the closing the file the
> Parquet writer adds the footer to the end of the file.
>
> The footer includes the file schema (column names and their types) as well
> as details about every row group (total size, number of rows, min/max
> statistics, number of NULL values for every column).
>
> Note that this column statistics is per row group, not for the entire file.
>
> -----Original Message-----
> From: Micah Kornfield <emkornfi...@gmail.com>
> Sent: Friday, October 15, 2021 1:40 PM
> To: david....@blackrock.com.invalid
> Cc: dev@parquet.apache.org
> Subject: Re: Concatenation of parquet files
>
> External Email: Use caution with links and attachments
>
>
> Hi David,
> I'm not sure I understand.  Concatenating files like this would likely
> break things.  In particular in the example:
>
>
> > Merged:
> > > > ROW GROUP A1
> > > > FOOTER A1
> > > > ROW GROUP A2
> > > > FOOTER A2
> > > > ROW GROUP B1
> > > > FOOTER B1
> > > > ROW GROUP B2
> > > > FOOTER B2
>
>
> There should only be one footer per file, otherwise, I don't think there
> is any means of discovering the A row groups.  Also, without rewriting
> metadata file offsets of B would be wrong (
> https://urldefense.com/v3/__https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift*L790__;Iw!!KSjYCgUGsB4!INGxroC5Q9scoC02spExuMiY-UGGa6F9mlnA-60rpFo3zVyAt4awQpe2iHFUQMQxT14$
> ).
>
>
> https://urldefense.com/v3/__https://arrow.apache.org/docs/python/parquet.html*finer-grained-reading-and-writing__;Iw!!KSjYCgUGsB4!INGxroC5Q9scoC02spExuMiY-UGGa6F9mlnA-60rpFo3zVyAt4awQpe2iHFU3_tBajY$
> > "We can similarly write a Parquet file with multiple row groups by
> > using ParquetWriter"
>
>
> Multiple row groups are fine.  Combining them after the fact  by simple
> file concatenation (which is what i understand the original question to be)
> would yield incorrect results.  If you reread small files and write them
> out again in one pass, that would be fine.
>
> Cheers,
> Micah
>
> On Fri, Oct 15, 2021 at 1:29 PM Lee, David <david....@blackrock.com
> .invalid>
> wrote:
>
> > Each row group should have its own statistics footer or dictionary..
> > Your file structure should look like this:
> >
> > > > *contents of parquet file A:*
> > > > ROW GROUP A1
> > > > FOOTER A1
> > > > ROW GROUP A2
> > > > FOOTER A2
> > > >
> > > > *contents of parquet file B:*
> > > > ROW GROUP B1
> > > > FOOTER B1
> > > > ROW GROUP B2
> > > > FOOTER B2
> >
> > Merged:
> > > > ROW GROUP A1
> > > > FOOTER A1
> > > > ROW GROUP A2
> > > > FOOTER A2
> > > > ROW GROUP B1
> > > > FOOTER B1
> > > > ROW GROUP B2
> > > > FOOTER B2
> >
> > I frequently concatenate smaller parquet files by appending rowgroups
> > until I hit an optimal 125 meg file size for HDFS.
> >
> >
> > https://urldefense.com/v3/__https://arrow.apache.org/docs/python/parqu
> > et.html*finer-grained-reading-and-writing__;Iw!!KSjYCgUGsB4!INGxroC5Q9
> > scoC02spExuMiY-UGGa6F9mlnA-60rpFo3zVyAt4awQpe2iHFU3_tBajY$
> > "We can similarly write a Parquet file with multiple row groups by
> > using ParquetWriter"
> >
> > -----Original Message-----
> > From: Pau Tallada <tall...@pic.es>
> > Sent: Tuesday, September 14, 2021 6:01 AM
> > To: dev@parquet.apache.org
> > Subject: Re: Concatenation of parquet files
> >
> > External Email: Use caution with links and attachments
> >
> >
> > Dear Gabor,
> >
> > Thanks a lot for the clarification! ☺
> > I understand this is not a common use case, I somewhat just had hope
> > it could be done easily :P
> >
> > If you are interested, I attach a collab notebook where it shows this
> > behaviour. The same data written three times produces different binary
> > contents.
> >
> > https://urldefense.com/v3/__https://colab.research.google.com/drive/1z
> > 7VFeEagWk-YAfi4W1CioKUNh0OheQ9f?usp=sharing__;!!KSjYCgUGsB4!Jzx-9D-0Fe
> > 2aFLJ5YPThSjNeLFi-BGs-mr0kmvpew1AC2er-i3m1NCRGGRyXqWt1evQ$
> >
> > Thanks again and best regards,
> >
> > Pau
> >
> > Missatge de Gabor Szadovszky <ga...@apache.org> del dia dt., 14 de set.
> > 2021 a les 10:54:
> >
> > > Hi Pau,
> > >
> > > I guess attachments are not allowed in the apache lists so we cannot
> > > see the image.
> > >
> > > If the two row groups contain the very same data in the same order
> > > and encoded with the same encoding, compressed with the same codec I
> > > think, they should be the same binary. I am not sure why you have
> > > different binary streams for these row groups but if the proper data
> > > can be decoded from both row groups I would not spend too much time
> > > on
> > it.
> > >
> > > About merging row groups. It is a tough issue and far not that
> > > simple as concatenating the row groups (files) and creating a new
> footer.
> > > There are statistics in the footer that you have to take care about
> > > as well as column indexes and bloom filters that are not part of the
> > > footer and neither the row groups. (They are written in separate
> > > data structures before the
> > > footer.)
> > > If you don't want to decode the row groups these statistics can be
> > > updated (with the new offsets) as well as the new footer can be
> > > created by reading the original footers only. The problem here is
> > > creating such a parquet file is not very useful in most cases. Most
> > > of the problems come from many small row groups (in small files)
> > > which cannot be solved this way. To solve the small files problem we
> > > need to merge the row groups and for that we need to decode the
> > > original data so we can re-create the statistics (at least for bloom
> filters).
> > >
> > > Long story short, theoretically it is solvable but it is a feature
> > > we haven't implemented properly so far.
> > >
> > > Cheers,
> > > Gabor
> > >
> > > On Tue, Sep 14, 2021 at 10:08 AM Pau Tallada <tall...@pic.es> wrote:
> > >
> > > > Hi,
> > > >
> > > > I am a developer of cosmohub.pic.es, a science platform that
> > > > provides interactive analysis and exploration of large scientific
> > datasets.
> > > Working
> > > > with Hive, users are able to generate the subset of data they are
> > > > interested in, and this result set is stored as a set of files.
> > > > When
> > > users
> > > > want to download this dataset, we combine/concatenate all the
> > > > files on-the-fly to generate a single stream that gets downloaded.
> > > > Done right, this is very efficient, avoids materializing the
> > > > combined file and the stream is even seekable so downloads can be
> > > > resumed. We are able to do
> > > this
> > > > for csv.bz2 and FITS formats.
> > > >
> > > > I am trying to do the same with parquet. Looking at the format
> > > > specification, it seems that it could be done by simply
> > > > concatenating the binary blobs of the set of row groups and
> > > > generating a new footer for the merged file. The problem is that
> > > > the same data, written twice in the same file (in two row groups),
> > > > is represented with some differences in the binary stream produced
> > > > (see attached image). Why is the binary representation of a row
> > > > group different if the data is the same? Is the order or position
> > > > of a row
> > group codified inside its metadata?
> > > >
> > > > I attach the image of a parquet file with the same data (a single
> > > > integer column named 'c' with a single value 0) written twice,
> > > > with at least two differences marked in red and blue.
> > > > [image: image.png]
> > > >
> > > >
> > > > A little diagram to show what I'm trying to accomplish:
> > > >
> > > > *contents of parquet file A:*
> > > > PAR1
> > > > ROW GROUP A1
> > > > ROW GROUP A2
> > > > FOOTER A
> > > >
> > > > *contents of parquet file B:*
> > > > PAR1
> > > > ROW GROUP B1
> > > > ROW GROUP B2
> > > > FOOTER B
> > > >
> > > > If I'm not mistaken, there is no metadata in each row group that
> > > > refers
> > > to
> > > > its file or its position, so they should be relocatable. The final
> > > > file/stream would look like this:
> > > >
> > > > *contents of combined parquet file:*
> > > > PAR1
> > > > ROW GROUP A1
> > > > ROW GROUP A2
> > > > ROW GROUP B1
> > > > ROW GROUP B2
> > > > NEW FOOTER A+B
> > > >
> > > > Thanks a lot in advance for the help understanding this,
> > > >
> > > > Best regards,
> > > >
> > > > Pau.
> > > > --
> > > > ----------------------------------
> > > > Pau Tallada Crespí
> > > > Departament de Serveis
> > > > Port d'Informació Científica (PIC)
> > > > Tel: +34 93 170 2729
> > > > ----------------------------------
> > > >
> > > >
> > >
> >
> >
> > --
> > ----------------------------------
> > Pau Tallada Crespí
> > Departament de Serveis
> > Port d'Informació Científica (PIC)
> > Tel: +34 93 170 2729
> > ----------------------------------
> >
> >
> > This message may contain information that is confidential or privileged.
> > If you are not the intended recipient, please advise the sender
> > immediately and delete this message. See
> > http://www.blackrock.com/corporate/compliance/email-disclaimers for
> > further information.  Please refer to
> > http://www.blackrock.com/corporate/compliance/privacy-policy for more
> > information about BlackRock’s Privacy Policy.
> >
> >
> > For a list of BlackRock's office addresses worldwide, see
> > http://www.blackrock.com/corporate/about-us/contacts-locations.
> >
> > © 2021 BlackRock, Inc. All rights reserved.
> >
>

RE: Concatenation of parquet files

Reply via email to