Re: Is Parquet Meant As a Standalone Database or is a Catalog/Metastore Required?

Curt Hagenlocher Sat, 18 May 2024 09:10:23 -0700

While CSV is still the undisputed monarch of exchanging data via files,
Parquet is arguably "top 3" -- and this is a scenario in which the file
does really need to be self-contained.


On Sat, May 18, 2024 at 9:01 AM Raphael Taylor-Davies
<[email protected]> wrote:

> Hi Fokko,
>
> I am aware of catalogs such as iceberg, my question was if in the design
> of parquet we can assume the existence of such a catalog.
>
> Kind Regards,
>
> Raphael
>
> On 18 May 2024 16:18:22 BST, Fokko Driesprong <[email protected]> wrote:
> >Hey Raphael,
> >
> >Thanks for reaching out here. Have you looked into table formats such as
> Apache
> >Iceberg <https://iceberg.apache.org/docs/nightly/>? This seems to fix the
> >problem that you're describing
> >
> >A table format adds an ACID layer to the file format and acts as a fully
> >functional database. In the case of Iceberg, a catalog is required for
> >atomicity, and alternatives like Delta Lake also seem to trend into that
> >direction
> ><
> https://github.com/orgs/delta-io/projects/10/views/1?pane=issue&itemId=57584023
> >
> >.
> >
> >I'm conscious that for many users this responsibility is instead delegated
> >> to a catalog that maintains its own index structures and statistics,
> only relies
> >> on the parquet metadata for very late stage pruning, and may therefore
> >> see limited benefit from revisiting the parquet metadata structures.
> >
> >
> >This is exactly what Iceberg offers, it provides additional metadata to
> >speed up the planning process:
> >https://iceberg.apache.org/docs/nightly/performance/
> >
> >Kind regards,
> >Fokko
> >
> >Op za 18 mei 2024 om 16:40 schreef Raphael Taylor-Davies
> ><[email protected]>:
> >
> >> Hi All,
> >>
> >> The recent discussions about metadata make me wonder where a storage
> >> format ends and a database begins, as people seem to have differing
> >> expectations of parquet here. In particular, one school of thought
> >> posits that parquet should suffice as a standalone technology, where
> >> users can write parquet files to a store and efficiently query them
> >> directly with no additional technologies. However, others instead view
> >> parquet as a storage format for use in conjunction with some sort of
> >> catalog / metastore. These two approaches naturally place very different
> >> demands on the parquet format. The former case incentivizes constructing
> >> extremely large parquet files, potentially on the order of TBs [1], such
> >> that the parquet metadata alone can efficiently be used to service a
> >> query without lots of random I/O to separate files. However, the latter
> >> case incentivizes relatively small parquet files (< 1GB) laid out in
> >> such a way that the catalog metadata can be used to efficiently identify
> >> a much smaller set of files for a given query, and write amplification
> >> can be avoided for inserts.
> >>
> >> Having only ever used parquet in the context of data lake style systems,
> >> the catalog approach comes more naturally to me and plays to parquet's
> >> current strengths, however, this does not seem to be a universally held
> >> expectation. I've frequently found people surprised when queries
> >> performed in the absence of a catalog are slow, or who wish to
> >> efficiently mutate or append to parquet files in place [2] [3] [4]. It
> >> is possibly anecdotal but these expectations seem to be more common
> >> where people are coming from python-based tooling such as pandas, and
> >> might reflect weaker tooling support for catalog systems in this
> ecosystem.
> >>
> >> Regardless this mismatch appears to be at the core of at least some of
> >> the discussions about metadata. I do not think it a controversial take
> >> that the current metadata structures are simply not setup for files on
> >> the order of >1TB, where the metadata balloons to 10s or 100s of MB and
> >> takes 10s of milliseconds just to parse. If this is in scope it would
> >> justify major changes to the parquet metadata, however, I'm conscious
> >> that for many users this responsibility is instead delegated to a
> >> catalog that maintains its own index structures and statistics, only
> >> relies on the parquet metadata for very late stage pruning, and may
> >> therefore see limited benefit from revisiting the parquet metadata
> >> structures.
> >>
> >> I'd be very interested to hear other people's thoughts on this.
> >>
> >> Kind Regards,
> >>
> >> Raphael
> >>
> >> [1]: https://github.com/apache/arrow-rs/issues/5770
> >> [2]: https://github.com/apache/datafusion/issues/9654
> >> [3]:
> >> https://github.com/datafusion-contrib/datafusion-objectstore-s3/pull/53
> >> [4]: https://github.com/apache/arrow-rs/issues/557
> >>
> >>
>

Re: Is Parquet Meant As a Standalone Database or is a Catalog/Metastore Required?

Reply via email to