Dear Gabor, Thanks a lot for the clarification! ☺ I understand this is not a common use case, I somewhat just had hope it could be done easily :P
If you are interested, I attach a collab notebook where it shows this behaviour. The same data written three times produces different binary contents. https://colab.research.google.com/drive/1z7VFeEagWk-YAfi4W1CioKUNh0OheQ9f?usp=sharing Thanks again and best regards, Pau Missatge de Gabor Szadovszky <ga...@apache.org> del dia dt., 14 de set. 2021 a les 10:54: > Hi Pau, > > I guess attachments are not allowed in the apache lists so we cannot see > the image. > > If the two row groups contain the very same data in the same order and > encoded with the same encoding, compressed with the same codec I think, > they should be the same binary. I am not sure why you have different binary > streams for these row groups but if the proper data can be decoded from > both row groups I would not spend too much time on it. > > About merging row groups. It is a tough issue and far not that simple as > concatenating the row groups (files) and creating a new footer. There are > statistics in the footer that you have to take care about as well as column > indexes and bloom filters that are not part of the footer and neither the > row groups. (They are written in separate data structures before the > footer.) > If you don't want to decode the row groups these statistics can be updated > (with the new offsets) as well as the new footer can be created by reading > the original footers only. The problem here is creating such a parquet file > is not very useful in most cases. Most of the problems come from many small > row groups (in small files) which cannot be solved this way. To solve the > small files problem we need to merge the row groups and for that we need to > decode the original data so we can re-create the statistics (at least for > bloom filters). > > Long story short, theoretically it is solvable but it is a feature we > haven't implemented properly so far. > > Cheers, > Gabor > > On Tue, Sep 14, 2021 at 10:08 AM Pau Tallada <tall...@pic.es> wrote: > > > Hi, > > > > I am a developer of cosmohub.pic.es, a science platform that provides > > interactive analysis and exploration of large scientific datasets. > Working > > with Hive, users are able to generate the subset of data they are > > interested in, and this result set is stored as a set of files. When > users > > want to download this dataset, we combine/concatenate all the files > > on-the-fly to generate a single stream that gets downloaded. Done right, > > this is very efficient, avoids materializing the combined file and the > > stream is even seekable so downloads can be resumed. We are able to do > this > > for csv.bz2 and FITS formats. > > > > I am trying to do the same with parquet. Looking at the format > > specification, it seems that it could be done by simply concatenating the > > binary blobs of the set of row groups and generating a new footer for the > > merged file. The problem is that the same data, written twice in the same > > file (in two row groups), is represented with some differences in the > > binary stream produced (see attached image). Why is the binary > > representation of a row group different if the data is the same? Is the > > order or position of a row group codified inside its metadata? > > > > I attach the image of a parquet file with the same data (a single integer > > column named 'c' with a single value 0) written twice, with at least two > > differences marked in red and blue. > > [image: image.png] > > > > > > A little diagram to show what I'm trying to accomplish: > > > > *contents of parquet file A:* > > PAR1 > > ROW GROUP A1 > > ROW GROUP A2 > > FOOTER A > > > > *contents of parquet file B:* > > PAR1 > > ROW GROUP B1 > > ROW GROUP B2 > > FOOTER B > > > > If I'm not mistaken, there is no metadata in each row group that refers > to > > its file or its position, so they should be relocatable. The final > > file/stream would look like this: > > > > *contents of combined parquet file:* > > PAR1 > > ROW GROUP A1 > > ROW GROUP A2 > > ROW GROUP B1 > > ROW GROUP B2 > > NEW FOOTER A+B > > > > Thanks a lot in advance for the help understanding this, > > > > Best regards, > > > > Pau. > > -- > > ---------------------------------- > > Pau Tallada Crespí > > Departament de Serveis > > Port d'Informació Científica (PIC) > > Tel: +34 93 170 2729 > > ---------------------------------- > > > > > -- ---------------------------------- Pau Tallada Crespí Departament de Serveis Port d'Informació Científica (PIC) Tel: +34 93 170 2729 ----------------------------------