Hi Pau,

I guess attachments are not allowed in the apache lists so we cannot see
the image.

If the two row groups contain the very same data in the same order and
encoded with the same encoding, compressed with the same codec I think,
they should be the same binary. I am not sure why you have different binary
streams for these row groups but if the proper data can be decoded from
both row groups I would not spend too much time on it.

About merging row groups. It is a tough issue and far not that simple as
concatenating the row groups (files) and creating a new footer. There are
statistics in the footer that you have to take care about as well as column
indexes and bloom filters that are not part of the footer and neither the
row groups. (They are written in separate data structures before the
footer.)
If you don't want to decode the row groups these statistics can be updated
(with the new offsets) as well as the new footer can be created by reading
the original footers only. The problem here is creating such a parquet file
is not very useful in most cases. Most of the problems come from many small
row groups (in small files) which cannot be solved this way. To solve the
small files problem we need to merge the row groups and for that we need to
decode the original data so we can re-create the statistics (at least for
bloom filters).

Long story short, theoretically it is solvable but it is a feature we
haven't implemented properly so far.

Cheers,
Gabor

On Tue, Sep 14, 2021 at 10:08 AM Pau Tallada <tall...@pic.es> wrote:

> Hi,
>
> I am a developer of cosmohub.pic.es, a science platform that provides
> interactive analysis and exploration of large scientific datasets. Working
> with Hive, users are able to generate the subset of data they are
> interested in, and this result set is stored as a set of files. When users
> want to download this dataset, we combine/concatenate all the files
> on-the-fly to generate a single stream that gets downloaded. Done right,
> this is very efficient, avoids materializing the combined file and the
> stream is even seekable so downloads can be resumed. We are able to do this
> for csv.bz2 and FITS formats.
>
> I am trying to do the same with parquet. Looking at the format
> specification, it seems that it could be done by simply concatenating the
> binary blobs of the set of row groups and generating a new footer for the
> merged file. The problem is that the same data, written twice in the same
> file (in two row groups), is represented with some differences in the
> binary stream produced (see attached image). Why is the binary
> representation of a row group different if the data is the same? Is the
> order or position of a row group codified inside its metadata?
>
> I attach the image of a parquet file with the same data (a single integer
> column named 'c' with a single value 0) written twice, with at least two
> differences marked in red and blue.
> [image: image.png]
>
>
> A little diagram to show what I'm trying to accomplish:
>
> *contents of parquet file A:*
> PAR1
> ROW GROUP A1
> ROW GROUP A2
> FOOTER A
>
> *contents of parquet file B:*
> PAR1
> ROW GROUP B1
> ROW GROUP B2
> FOOTER B
>
> If I'm not mistaken, there is no metadata in each row group that refers to
> its file or its position, so they should be relocatable. The final
> file/stream would look like this:
>
> *contents of combined parquet file:*
> PAR1
> ROW GROUP A1
> ROW GROUP A2
> ROW GROUP B1
> ROW GROUP B2
> NEW FOOTER A+B
>
> Thanks a lot in advance for the help understanding this,
>
> Best regards,
>
> Pau.
> --
> ----------------------------------
> Pau Tallada Crespí
> Departament de Serveis
> Port d'Informació Científica (PIC)
> Tel: +34 93 170 2729
> ----------------------------------
>
>

Reply via email to