RE: Concatenation of parquet files

Lee, David Fri, 15 Oct 2021 14:04:42 -0700

Well this is right and wrong.. There is one footer, but the statistics are 
captured per row group which allows rowgroups to be easily concatenated into a 
new file without rebuiliding column stats.


The final file looks more like:

> > > ROW GROUP A1
> > > ROW GROUP A2
> > > ROW GROUP B1
> > > ROW GROUP B2
> > > FOOTER A1, A2, B1, B2

http://cloudsqale.com/2020/05/29/how-parquet-files-are-written-row-groups-pages-required-memory-and-flush-operations/

When all the row groups are written and before the closing the file the Parquet 
writer adds the footer to the end of the file.

The footer includes the file schema (column names and their types) as well as 
details about every row group (total size, number of rows, min/max statistics, 
number of NULL values for every column). 

Note that this column statistics is per row group, not for the entire file.

-----Original Message-----
From: Micah Kornfield <emkornfi...@gmail.com> 
Sent: Friday, October 15, 2021 1:40 PM
To: david....@blackrock.com.invalid
Cc: dev@parquet.apache.org
Subject: Re: Concatenation of parquet files

External Email: Use caution with links and attachments


Hi David,
I'm not sure I understand.  Concatenating files like this would likely break 
things.  In particular in the example:


> Merged:
> > > ROW GROUP A1
> > > FOOTER A1
> > > ROW GROUP A2
> > > FOOTER A2
> > > ROW GROUP B1
> > > FOOTER B1
> > > ROW GROUP B2
> > > FOOTER B2


There should only be one footer per file, otherwise, I don't think there is any 
means of discovering the A row groups.  Also, without rewriting metadata file 
offsets of B would be wrong ( 
https://urldefense.com/v3/__https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift*L790__;Iw!!KSjYCgUGsB4!INGxroC5Q9scoC02spExuMiY-UGGa6F9mlnA-60rpFo3zVyAt4awQpe2iHFUQMQxT14$
).

https://urldefense.com/v3/__https://arrow.apache.org/docs/python/parquet.html*finer-grained-reading-and-writing__;Iw!!KSjYCgUGsB4!INGxroC5Q9scoC02spExuMiY-UGGa6F9mlnA-60rpFo3zVyAt4awQpe2iHFU3_tBajY$
> "We can similarly write a Parquet file with multiple row groups by 
> using ParquetWriter"


Multiple row groups are fine.  Combining them after the fact  by simple file 
concatenation (which is what i understand the original question to be) would 
yield incorrect results.  If you reread small files and write them out again in 
one pass, that would be fine.

Cheers,
Micah

On Fri, Oct 15, 2021 at 1:29 PM Lee, David <david....@blackrock.com.invalid>
wrote:

> Each row group should have its own statistics footer or dictionary.. 
> Your file structure should look like this:
>
> > > *contents of parquet file A:*
> > > ROW GROUP A1
> > > FOOTER A1
> > > ROW GROUP A2
> > > FOOTER A2
> > >
> > > *contents of parquet file B:*
> > > ROW GROUP B1
> > > FOOTER B1
> > > ROW GROUP B2
> > > FOOTER B2
>
> Merged:
> > > ROW GROUP A1
> > > FOOTER A1
> > > ROW GROUP A2
> > > FOOTER A2
> > > ROW GROUP B1
> > > FOOTER B1
> > > ROW GROUP B2
> > > FOOTER B2
>
> I frequently concatenate smaller parquet files by appending rowgroups 
> until I hit an optimal 125 meg file size for HDFS.
>
>
> https://urldefense.com/v3/__https://arrow.apache.org/docs/python/parqu
> et.html*finer-grained-reading-and-writing__;Iw!!KSjYCgUGsB4!INGxroC5Q9
> scoC02spExuMiY-UGGa6F9mlnA-60rpFo3zVyAt4awQpe2iHFU3_tBajY$
> "We can similarly write a Parquet file with multiple row groups by 
> using ParquetWriter"
>
> -----Original Message-----
> From: Pau Tallada <tall...@pic.es>
> Sent: Tuesday, September 14, 2021 6:01 AM
> To: dev@parquet.apache.org
> Subject: Re: Concatenation of parquet files
>
> External Email: Use caution with links and attachments
>
>
> Dear Gabor,
>
> Thanks a lot for the clarification! ☺
> I understand this is not a common use case, I somewhat just had hope 
> it could be done easily :P
>
> If you are interested, I attach a collab notebook where it shows this 
> behaviour. The same data written three times produces different binary 
> contents.
>
> https://urldefense.com/v3/__https://colab.research.google.com/drive/1z
> 7VFeEagWk-YAfi4W1CioKUNh0OheQ9f?usp=sharing__;!!KSjYCgUGsB4!Jzx-9D-0Fe
> 2aFLJ5YPThSjNeLFi-BGs-mr0kmvpew1AC2er-i3m1NCRGGRyXqWt1evQ$
>
> Thanks again and best regards,
>
> Pau
>
> Missatge de Gabor Szadovszky <ga...@apache.org> del dia dt., 14 de set.
> 2021 a les 10:54:
>
> > Hi Pau,
> >
> > I guess attachments are not allowed in the apache lists so we cannot 
> > see the image.
> >
> > If the two row groups contain the very same data in the same order 
> > and encoded with the same encoding, compressed with the same codec I 
> > think, they should be the same binary. I am not sure why you have 
> > different binary streams for these row groups but if the proper data 
> > can be decoded from both row groups I would not spend too much time 
> > on
> it.
> >
> > About merging row groups. It is a tough issue and far not that 
> > simple as concatenating the row groups (files) and creating a new footer.
> > There are statistics in the footer that you have to take care about 
> > as well as column indexes and bloom filters that are not part of the 
> > footer and neither the row groups. (They are written in separate 
> > data structures before the
> > footer.)
> > If you don't want to decode the row groups these statistics can be 
> > updated (with the new offsets) as well as the new footer can be 
> > created by reading the original footers only. The problem here is 
> > creating such a parquet file is not very useful in most cases. Most 
> > of the problems come from many small row groups (in small files) 
> > which cannot be solved this way. To solve the small files problem we 
> > need to merge the row groups and for that we need to decode the 
> > original data so we can re-create the statistics (at least for bloom 
> > filters).
> >
> > Long story short, theoretically it is solvable but it is a feature 
> > we haven't implemented properly so far.
> >
> > Cheers,
> > Gabor
> >
> > On Tue, Sep 14, 2021 at 10:08 AM Pau Tallada <tall...@pic.es> wrote:
> >
> > > Hi,
> > >
> > > I am a developer of cosmohub.pic.es, a science platform that 
> > > provides interactive analysis and exploration of large scientific
> datasets.
> > Working
> > > with Hive, users are able to generate the subset of data they are 
> > > interested in, and this result set is stored as a set of files. 
> > > When
> > users
> > > want to download this dataset, we combine/concatenate all the 
> > > files on-the-fly to generate a single stream that gets downloaded. 
> > > Done right, this is very efficient, avoids materializing the 
> > > combined file and the stream is even seekable so downloads can be 
> > > resumed. We are able to do
> > this
> > > for csv.bz2 and FITS formats.
> > >
> > > I am trying to do the same with parquet. Looking at the format 
> > > specification, it seems that it could be done by simply 
> > > concatenating the binary blobs of the set of row groups and 
> > > generating a new footer for the merged file. The problem is that 
> > > the same data, written twice in the same file (in two row groups), 
> > > is represented with some differences in the binary stream produced 
> > > (see attached image). Why is the binary representation of a row 
> > > group different if the data is the same? Is the order or position 
> > > of a row
> group codified inside its metadata?
> > >
> > > I attach the image of a parquet file with the same data (a single 
> > > integer column named 'c' with a single value 0) written twice, 
> > > with at least two differences marked in red and blue.
> > > [image: image.png]
> > >
> > >
> > > A little diagram to show what I'm trying to accomplish:
> > >
> > > *contents of parquet file A:*
> > > PAR1
> > > ROW GROUP A1
> > > ROW GROUP A2
> > > FOOTER A
> > >
> > > *contents of parquet file B:*
> > > PAR1
> > > ROW GROUP B1
> > > ROW GROUP B2
> > > FOOTER B
> > >
> > > If I'm not mistaken, there is no metadata in each row group that 
> > > refers
> > to
> > > its file or its position, so they should be relocatable. The final 
> > > file/stream would look like this:
> > >
> > > *contents of combined parquet file:*
> > > PAR1
> > > ROW GROUP A1
> > > ROW GROUP A2
> > > ROW GROUP B1
> > > ROW GROUP B2
> > > NEW FOOTER A+B
> > >
> > > Thanks a lot in advance for the help understanding this,
> > >
> > > Best regards,
> > >
> > > Pau.
> > > --
> > > ----------------------------------
> > > Pau Tallada Crespí
> > > Departament de Serveis
> > > Port d'Informació Científica (PIC)
> > > Tel: +34 93 170 2729
> > > ----------------------------------
> > >
> > >
> >
>
>
> --
> ----------------------------------
> Pau Tallada Crespí
> Departament de Serveis
> Port d'Informació Científica (PIC)
> Tel: +34 93 170 2729
> ----------------------------------
>
>
> This message may contain information that is confidential or privileged.
> If you are not the intended recipient, please advise the sender 
> immediately and delete this message. See 
> http://www.blackrock.com/corporate/compliance/email-disclaimers for 
> further information.  Please refer to 
> http://www.blackrock.com/corporate/compliance/privacy-policy for more 
> information about BlackRock’s Privacy Policy.
>
>
> For a list of BlackRock's office addresses worldwide, see 
> http://www.blackrock.com/corporate/about-us/contacts-locations.
>
> © 2021 BlackRock, Inc. All rights reserved.
>

RE: Concatenation of parquet files

Reply via email to