Hi, I am a developer of cosmohub.pic.es, a science platform that provides interactive analysis and exploration of large scientific datasets. Working with Hive, users are able to generate the subset of data they are interested in, and this result set is stored as a set of files. When users want to download this dataset, we combine/concatenate all the files on-the-fly to generate a single stream that gets downloaded. Done right, this is very efficient, avoids materializing the combined file and the stream is even seekable so downloads can be resumed. We are able to do this for csv.bz2 and FITS formats.
I am trying to do the same with parquet. Looking at the format specification, it seems that it could be done by simply concatenating the binary blobs of the set of row groups and generating a new footer for the merged file. The problem is that the same data, written twice in the same file (in two row groups), is represented with some differences in the binary stream produced (see attached image). Why is the binary representation of a row group different if the data is the same? Is the order or position of a row group codified inside its metadata? I attach the image of a parquet file with the same data (a single integer column named 'c' with a single value 0) written twice, with at least two differences marked in red and blue. [image: image.png] A little diagram to show what I'm trying to accomplish: *contents of parquet file A:* PAR1 ROW GROUP A1 ROW GROUP A2 FOOTER A *contents of parquet file B:* PAR1 ROW GROUP B1 ROW GROUP B2 FOOTER B If I'm not mistaken, there is no metadata in each row group that refers to its file or its position, so they should be relocatable. The final file/stream would look like this: *contents of combined parquet file:* PAR1 ROW GROUP A1 ROW GROUP A2 ROW GROUP B1 ROW GROUP B2 NEW FOOTER A+B Thanks a lot in advance for the help understanding this, Best regards, Pau. -- ---------------------------------- Pau Tallada Crespí Departament de Serveis Port d'Informació Científica (PIC) Tel: +34 93 170 2729 ----------------------------------