Re: Concatenation of parquet files

Pau Tallada Tue, 14 Sep 2021 06:01:17 -0700

Dear Gabor,

Thanks a lot for the clarification! ☺
I understand this is not a common use case, I somewhat just had hope it
could be done easily :P


If you are interested, I attach a collab notebook where it shows this
behaviour. The same data written three times produces different binary
contents.
https://colab.research.google.com/drive/1z7VFeEagWk-YAfi4W1CioKUNh0OheQ9f?usp=sharing

Thanks again and best regards,

Pau

Missatge de Gabor Szadovszky <ga...@apache.org> del dia dt., 14 de set.
2021 a les 10:54:

> Hi Pau,
>
> I guess attachments are not allowed in the apache lists so we cannot see
> the image.
>
> If the two row groups contain the very same data in the same order and
> encoded with the same encoding, compressed with the same codec I think,
> they should be the same binary. I am not sure why you have different binary
> streams for these row groups but if the proper data can be decoded from
> both row groups I would not spend too much time on it.
>
> About merging row groups. It is a tough issue and far not that simple as
> concatenating the row groups (files) and creating a new footer. There are
> statistics in the footer that you have to take care about as well as column
> indexes and bloom filters that are not part of the footer and neither the
> row groups. (They are written in separate data structures before the
> footer.)
> If you don't want to decode the row groups these statistics can be updated
> (with the new offsets) as well as the new footer can be created by reading
> the original footers only. The problem here is creating such a parquet file
> is not very useful in most cases. Most of the problems come from many small
> row groups (in small files) which cannot be solved this way. To solve the
> small files problem we need to merge the row groups and for that we need to
> decode the original data so we can re-create the statistics (at least for
> bloom filters).
>
> Long story short, theoretically it is solvable but it is a feature we
> haven't implemented properly so far.
>
> Cheers,
> Gabor
>
> On Tue, Sep 14, 2021 at 10:08 AM Pau Tallada <tall...@pic.es> wrote:
>
> > Hi,
> >
> > I am a developer of cosmohub.pic.es, a science platform that provides
> > interactive analysis and exploration of large scientific datasets.
> Working
> > with Hive, users are able to generate the subset of data they are
> > interested in, and this result set is stored as a set of files. When
> users
> > want to download this dataset, we combine/concatenate all the files
> > on-the-fly to generate a single stream that gets downloaded. Done right,
> > this is very efficient, avoids materializing the combined file and the
> > stream is even seekable so downloads can be resumed. We are able to do
> this
> > for csv.bz2 and FITS formats.
> >
> > I am trying to do the same with parquet. Looking at the format
> > specification, it seems that it could be done by simply concatenating the
> > binary blobs of the set of row groups and generating a new footer for the
> > merged file. The problem is that the same data, written twice in the same
> > file (in two row groups), is represented with some differences in the
> > binary stream produced (see attached image). Why is the binary
> > representation of a row group different if the data is the same? Is the
> > order or position of a row group codified inside its metadata?
> >
> > I attach the image of a parquet file with the same data (a single integer
> > column named 'c' with a single value 0) written twice, with at least two
> > differences marked in red and blue.
> > [image: image.png]
> >
> >
> > A little diagram to show what I'm trying to accomplish:
> >
> > *contents of parquet file A:*
> > PAR1
> > ROW GROUP A1
> > ROW GROUP A2
> > FOOTER A
> >
> > *contents of parquet file B:*
> > PAR1
> > ROW GROUP B1
> > ROW GROUP B2
> > FOOTER B
> >
> > If I'm not mistaken, there is no metadata in each row group that refers
> to
> > its file or its position, so they should be relocatable. The final
> > file/stream would look like this:
> >
> > *contents of combined parquet file:*
> > PAR1
> > ROW GROUP A1
> > ROW GROUP A2
> > ROW GROUP B1
> > ROW GROUP B2
> > NEW FOOTER A+B
> >
> > Thanks a lot in advance for the help understanding this,
> >
> > Best regards,
> >
> > Pau.
> > --
> > ----------------------------------
> > Pau Tallada Crespí
> > Departament de Serveis
> > Port d'Informació Científica (PIC)
> > Tel: +34 93 170 2729
> > ----------------------------------
> >
> >
>


-- 
----------------------------------
Pau Tallada Crespí
Departament de Serveis
Port d'Informació Científica (PIC)
Tel: +34 93 170 2729
----------------------------------

Re: Concatenation of parquet files

Reply via email to