Hi All,

The recent discussions about metadata make me wonder where a storage format ends and a database begins, as people seem to have differing expectations of parquet here. In particular, one school of thought posits that parquet should suffice as a standalone technology, where users can write parquet files to a store and efficiently query them directly with no additional technologies. However, others instead view parquet as a storage format for use in conjunction with some sort of catalog / metastore. These two approaches naturally place very different demands on the parquet format. The former case incentivizes constructing extremely large parquet files, potentially on the order of TBs [1], such that the parquet metadata alone can efficiently be used to service a query without lots of random I/O to separate files. However, the latter case incentivizes relatively small parquet files (< 1GB) laid out in such a way that the catalog metadata can be used to efficiently identify a much smaller set of files for a given query, and write amplification can be avoided for inserts.

Having only ever used parquet in the context of data lake style systems, the catalog approach comes more naturally to me and plays to parquet's current strengths, however, this does not seem to be a universally held expectation. I've frequently found people surprised when queries performed in the absence of a catalog are slow, or who wish to efficiently mutate or append to parquet files in place [2] [3] [4]. It is possibly anecdotal but these expectations seem to be more common where people are coming from python-based tooling such as pandas, and might reflect weaker tooling support for catalog systems in this ecosystem.

Regardless this mismatch appears to be at the core of at least some of the discussions about metadata. I do not think it a controversial take that the current metadata structures are simply not setup for files on the order of >1TB, where the metadata balloons to 10s or 100s of MB and takes 10s of milliseconds just to parse. If this is in scope it would justify major changes to the parquet metadata, however, I'm conscious that for many users this responsibility is instead delegated to a catalog that maintains its own index structures and statistics, only relies on the parquet metadata for very late stage pruning, and may therefore see limited benefit from revisiting the parquet metadata structures.

I'd be very interested to hear other people's thoughts on this.

Kind Regards,

Raphael

[1]: https://github.com/apache/arrow-rs/issues/5770
[2]: https://github.com/apache/datafusion/issues/9654
[3]: https://github.com/datafusion-contrib/datafusion-objectstore-s3/pull/53
[4]: https://github.com/apache/arrow-rs/issues/557

Reply via email to