Is Parquet Meant As a Standalone Database or is a Catalog/Metastore Required?

Raphael Taylor-Davies Sat, 18 May 2024 07:39:59 -0700

Hi All,

The recent discussions about metadata make me wonder where a storageformat ends and a database begins, as people seem to have differingexpectations of parquet here. In particular, one school of thoughtposits that parquet should suffice as a standalone technology, whereusers can write parquet files to a store and efficiently query themdirectly with no additional technologies. However, others instead viewparquet as a storage format for use in conjunction with some sort ofcatalog / metastore. These two approaches naturally place very differentdemands on the parquet format. The former case incentivizes constructingextremely large parquet files, potentially on the order of TBs [1], suchthat the parquet metadata alone can efficiently be used to service aquery without lots of random I/O to separate files. However, the lattercase incentivizes relatively small parquet files (< 1GB) laid out insuch a way that the catalog metadata can be used to efficiently identifya much smaller set of files for a given query, and write amplificationcan be avoided for inserts.

Having only ever used parquet in the context of data lake style systems,the catalog approach comes more naturally to me and plays to parquet'scurrent strengths, however, this does not seem to be a universally heldexpectation. I've frequently found people surprised when queriesperformed in the absence of a catalog are slow, or who wish toefficiently mutate or append to parquet files in place [2] [3] [4]. Itis possibly anecdotal but these expectations seem to be more commonwhere people are coming from python-based tooling such as pandas, andmight reflect weaker tooling support for catalog systems in this ecosystem.

Regardless this mismatch appears to be at the core of at least some ofthe discussions about metadata. I do not think it a controversial takethat the current metadata structures are simply not setup for files onthe order of >1TB, where the metadata balloons to 10s or 100s of MB andtakes 10s of milliseconds just to parse. If this is in scope it wouldjustify major changes to the parquet metadata, however, I'm consciousthat for many users this responsibility is instead delegated to acatalog that maintains its own index structures and statistics, onlyrelies on the parquet metadata for very late stage pruning, and maytherefore see limited benefit from revisiting the parquet metadatastructures.


I'd be very interested to hear other people's thoughts on this.

Kind Regards,

Raphael

[1]: https://github.com/apache/arrow-rs/issues/5770
[2]: https://github.com/apache/datafusion/issues/9654
[3]: https://github.com/datafusion-contrib/datafusion-objectstore-s3/pull/53
[4]: https://github.com/apache/arrow-rs/issues/557

Is Parquet Meant As a Standalone Database or is a Catalog/Metastore Required?

Reply via email to