Hi David,
This is one solution for consolidating parquet files but as long as one is
rewriting them consolidating small row groups could also make sense.  It is
also worth noting:
1.  I don't think this is what the original poster was looking for since
reading the files involves decompression and decoding.
2.  There isn't a guarantee that encoding and compression will stay the
same between different versions/implementations of parquet (e.g. I think
there are different thresholds for dictionary encoding).

On Friday, October 15, 2021, Lee, David <david....@blackrock.com> wrote:

> Here was my solution back in 2018.. It's easier to do now with pyarrow's
> python APIs than Spark..
>
>
> https://stackoverflow.com/questions/39187622/how-do-you-control-the-size-of-the-output-file/51216145#51216145
>
> Read all the smaller files in your list one at a time and write them to
> the temp file as parquet ROW GROUP. It is very important to write each file
> in as a row group which preserves compression encoding and guarantees the
> amount of bytes (minus schema metadata) written will be the same as the
> original file size.
>
> -----Original Message-----
> From: Lee, David
> Sent: Friday, October 15, 2021 2:04 PM
> To: dev@parquet.apache.org; 'emkornfi...@gmail.com' <emkornfi...@gmail.com>;
> david....@blackrock.com.invalid
> Subject: RE: Concatenation of parquet files
>
> Well this is right and wrong.. There is one footer, but the statistics are
> captured per row group which allows rowgroups to be easily concatenated
> into a new file without rebuiliding column stats.
>
> The final file looks more like:
>
> > > > ROW GROUP A1
> > > > ROW GROUP A2
> > > > ROW GROUP B1
> > > > ROW GROUP B2
> > > > FOOTER A1, A2, B1, B2
>
>
> http://cloudsqale.com/2020/05/29/how-parquet-files-are-written-row-groups-pages-required-memory-and-flush-operations/
>
> When all the row groups are written and before the closing the file the
> Parquet writer adds the footer to the end of the file.
>
> The footer includes the file schema (column names and their types) as well
> as details about every row group (total size, number of rows, min/max
> statistics, number of NULL values for every column).
>
> Note that this column statistics is per row group, not for the entire file.
>
> -----Original Message-----
> From: Micah Kornfield <emkornfi...@gmail.com>
> Sent: Friday, October 15, 2021 1:40 PM
> To: david....@blackrock.com.invalid
> Cc: dev@parquet.apache.org
> Subject: Re: Concatenation of parquet files
>
> External Email: Use caution with links and attachments
>
>
> Hi David,
> I'm not sure I understand.  Concatenating files like this would likely
> break things.  In particular in the example:
>
>
> > Merged:
> > > > ROW GROUP A1
> > > > FOOTER A1
> > > > ROW GROUP A2
> > > > FOOTER A2
> > > > ROW GROUP B1
> > > > FOOTER B1
> > > > ROW GROUP B2
> > > > FOOTER B2
>
>
> There should only be one footer per file, otherwise, I don't think there
> is any means of discovering the A row groups.  Also, without rewriting
> metadata file offsets of B would be wrong (
> https://urldefense.com/v3/__https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift*L790__;Iw!!KSjYCgUGsB4!INGxroC5Q9scoC02spExuMiY-UGGa6F9mlnA-60rpFo3zVyAt4awQpe2iHFUQMQxT14$
> ).
>
>
> https://urldefense.com/v3/__https://arrow.apache.org/docs/python/parquet.html*finer-grained-reading-and-writing__;Iw!!KSjYCgUGsB4!INGxroC5Q9scoC02spExuMiY-UGGa6F9mlnA-60rpFo3zVyAt4awQpe2iHFU3_tBajY$
> > "We can similarly write a Parquet file with multiple row groups by
> > using ParquetWriter"
>
>
> Multiple row groups are fine.  Combining them after the fact  by simple
> file concatenation (which is what i understand the original question to be)
> would yield incorrect results.  If you reread small files and write them
> out again in one pass, that would be fine.
>
> Cheers,
> Micah
>
> On Fri, Oct 15, 2021 at 1:29 PM Lee, David <david....@blackrock.com
> .invalid>
> wrote:
>
> > Each row group should have its own statistics footer or dictionary..
> > Your file structure should look like this:
> >
> > > > *contents of parquet file A:*
> > > > ROW GROUP A1
> > > > FOOTER A1
> > > > ROW GROUP A2
> > > > FOOTER A2
> > > >
> > > > *contents of parquet file B:*
> > > > ROW GROUP B1
> > > > FOOTER B1
> > > > ROW GROUP B2
> > > > FOOTER B2
> >
> > Merged:
> > > > ROW GROUP A1
> > > > FOOTER A1
> > > > ROW GROUP A2
> > > > FOOTER A2
> > > > ROW GROUP B1
> > > > FOOTER B1
> > > > ROW GROUP B2
> > > > FOOTER B2
> >
> > I frequently concatenate smaller parquet files by appending rowgroups
> > until I hit an optimal 125 meg file size for HDFS.
> >
> >
> > https://urldefense.com/v3/__https://arrow.apache.org/docs/python/parqu
> > et.html*finer-grained-reading-and-writing__;Iw!!KSjYCgUGsB4!INGxroC5Q9
> > scoC02spExuMiY-UGGa6F9mlnA-60rpFo3zVyAt4awQpe2iHFU3_tBajY$
> > "We can similarly write a Parquet file with multiple row groups by
> > using ParquetWriter"
> >
> > -----Original Message-----
> > From: Pau Tallada <tall...@pic.es>
> > Sent: Tuesday, September 14, 2021 6:01 AM
> > To: dev@parquet.apache.org
> > Subject: Re: Concatenation of parquet files
> >
> > External Email: Use caution with links and attachments
> >
> >
> > Dear Gabor,
> >
> > Thanks a lot for the clarification! ☺
> > I understand this is not a common use case, I somewhat just had hope
> > it could be done easily :P
> >
> > If you are interested, I attach a collab notebook where it shows this
> > behaviour. The same data written three times produces different binary
> > contents.
> >
> > https://urldefense.com/v3/__https://colab.research.google.com/drive/1z
> > 7VFeEagWk-YAfi4W1CioKUNh0OheQ9f?usp=sharing__;!!KSjYCgUGsB4!Jzx-9D-0Fe
> > 2aFLJ5YPThSjNeLFi-BGs-mr0kmvpew1AC2er-i3m1NCRGGRyXqWt1evQ$
> >
> > Thanks again and best regards,
> >
> > Pau
> >
> > Missatge de Gabor Szadovszky <ga...@apache.org> del dia dt., 14 de set.
> > 2021 a les 10:54:
> >
> > > Hi Pau,
> > >
> > > I guess attachments are not allowed in the apache lists so we cannot
> > > see the image.
> > >
> > > If the two row groups contain the very same data in the same order
> > > and encoded with the same encoding, compressed with the same codec I
> > > think, they should be the same binary. I am not sure why you have
> > > different binary streams for these row groups but if the proper data
> > > can be decoded from both row groups I would not spend too much time
> > > on
> > it.
> > >
> > > About merging row groups. It is a tough issue and far not that
> > > simple as concatenating the row groups (files) and creating a new
> footer.
> > > There are statistics in the footer that you have to take care about
> > > as well as column indexes and bloom filters that are not part of the
> > > footer and neither the row groups. (They are written in separate
> > > data structures before the
> > > footer.)
> > > If you don't want to decode the row groups these statistics can be
> > > updated (with the new offsets) as well as the new footer can be
> > > created by reading the original footers only. The problem here is
> > > creating such a parquet file is not very useful in most cases. Most
> > > of the problems come from many small row groups (in small files)
> > > which cannot be solved this way. To solve the small files problem we
> > > need to merge the row groups and for that we need to decode the
> > > original data so we can re-create the statistics (at least for bloom
> filters).
> > >
> > > Long story short, theoretically it is solvable but it is a feature
> > > we haven't implemented properly so far.
> > >
> > > Cheers,
> > > Gabor
> > >
> > > On Tue, Sep 14, 2021 at 10:08 AM Pau Tallada <tall...@pic.es> wrote:
> > >
> > > > Hi,
> > > >
> > > > I am a developer of cosmohub.pic.es, a science platform that
> > > > provides interactive analysis and exploration of large scientific
> > datasets.
> > > Working
> > > > with Hive, users are able to generate the subset of data they are
> > > > interested in, and this result set is stored as a set of files.
> > > > When
> > > users
> > > > want to download this dataset, we combine/concatenate all the
> > > > files on-the-fly to generate a single stream that gets downloaded.
> > > > Done right, this is very efficient, avoids materializing the
> > > > combined file and the stream is even seekable so downloads can be
> > > > resumed. We are able to do
> > > this
> > > > for csv.bz2 and FITS formats.
> > > >
> > > > I am trying to do the same with parquet. Looking at the format
> > > > specification, it seems that it could be done by simply
> > > > concatenating the binary blobs of the set of row groups and
> > > > generating a new footer for the merged file. The problem is that
> > > > the same data, written twice in the same file (in two row groups),
> > > > is represented with some differences in the binary stream produced
> > > > (see attached image). Why is the binary representation of a row
> > > > group different if the data is the same? Is the order or position
> > > > of a row
> > group codified inside its metadata?
> > > >
> > > > I attach the image of a parquet file with the same data (a single
> > > > integer column named 'c' with a single value 0) written twice,
> > > > with at least two differences marked in red and blue.
> > > > [image: image.png]
> > > >
> > > >
> > > > A little diagram to show what I'm trying to accomplish:
> > > >
> > > > *contents of parquet file A:*
> > > > PAR1
> > > > ROW GROUP A1
> > > > ROW GROUP A2
> > > > FOOTER A
> > > >
> > > > *contents of parquet file B:*
> > > > PAR1
> > > > ROW GROUP B1
> > > > ROW GROUP B2
> > > > FOOTER B
> > > >
> > > > If I'm not mistaken, there is no metadata in each row group that
> > > > refers
> > > to
> > > > its file or its position, so they should be relocatable. The final
> > > > file/stream would look like this:
> > > >
> > > > *contents of combined parquet file:*
> > > > PAR1
> > > > ROW GROUP A1
> > > > ROW GROUP A2
> > > > ROW GROUP B1
> > > > ROW GROUP B2
> > > > NEW FOOTER A+B
> > > >
> > > > Thanks a lot in advance for the help understanding this,
> > > >
> > > > Best regards,
> > > >
> > > > Pau.
> > > > --
> > > > ----------------------------------
> > > > Pau Tallada Crespí
> > > > Departament de Serveis
> > > > Port d'Informació Científica (PIC)
> > > > Tel: +34 93 170 2729
> > > > ----------------------------------
> > > >
> > > >
> > >
> >
> >
> > --
> > ----------------------------------
> > Pau Tallada Crespí
> > Departament de Serveis
> > Port d'Informació Científica (PIC)
> > Tel: +34 93 170 2729
> > ----------------------------------
> >
> >
> > This message may contain information that is confidential or privileged.
> > If you are not the intended recipient, please advise the sender
> > immediately and delete this message. See
> > http://www.blackrock.com/corporate/compliance/email-disclaimers for
> > further information.  Please refer to
> > http://www.blackrock.com/corporate/compliance/privacy-policy for more
> > information about BlackRock’s Privacy Policy.
> >
> >
> > For a list of BlackRock's office addresses worldwide, see
> > http://www.blackrock.com/corporate/about-us/contacts-locations.
> >
> > © 2021 BlackRock, Inc. All rights reserved.
> >
>

Reply via email to