Hi,

I am a developer of cosmohub.pic.es, a science platform that provides
interactive analysis and exploration of large scientific datasets. Working
with Hive, users are able to generate the subset of data they are
interested in, and this result set is stored as a set of files. When users
want to download this dataset, we combine/concatenate all the files
on-the-fly to generate a single stream that gets downloaded. Done right,
this is very efficient, avoids materializing the combined file and the
stream is even seekable so downloads can be resumed. We are able to do this
for csv.bz2 and FITS formats.

I am trying to do the same with parquet. Looking at the format
specification, it seems that it could be done by simply concatenating the
binary blobs of the set of row groups and generating a new footer for the
merged file. The problem is that the same data, written twice in the same
file (in two row groups), is represented with some differences in the
binary stream produced (see attached image). Why is the binary
representation of a row group different if the data is the same? Is the
order or position of a row group codified inside its metadata?

I attach the image of a parquet file with the same data (a single integer
column named 'c' with a single value 0) written twice, with at least two
differences marked in red and blue.
[image: image.png]


A little diagram to show what I'm trying to accomplish:

*contents of parquet file A:*
PAR1
ROW GROUP A1
ROW GROUP A2
FOOTER A

*contents of parquet file B:*
PAR1
ROW GROUP B1
ROW GROUP B2
FOOTER B

If I'm not mistaken, there is no metadata in each row group that refers to
its file or its position, so they should be relocatable. The final
file/stream would look like this:

*contents of combined parquet file:*
PAR1
ROW GROUP A1
ROW GROUP A2
ROW GROUP B1
ROW GROUP B2
NEW FOOTER A+B

Thanks a lot in advance for the help understanding this,

Best regards,

Pau.
-- 
----------------------------------
Pau Tallada Crespí
Departament de Serveis
Port d'Informació Científica (PIC)
Tel: +34 93 170 2729
----------------------------------

Reply via email to