Hi All,
The recent discussions about metadata make me wonder where a storage
format ends and a database begins, as people seem to have differing
expectations of parquet here. In particular, one school of thought
posits that parquet should suffice as a standalone technology, where
users can write parquet files to a store and efficiently query them
directly with no additional technologies. However, others instead view
parquet as a storage format for use in conjunction with some sort of
catalog / metastore. These two approaches naturally place very different
demands on the parquet format. The former case incentivizes constructing
extremely large parquet files, potentially on the order of TBs [1], such
that the parquet metadata alone can efficiently be used to service a
query without lots of random I/O to separate files. However, the latter
case incentivizes relatively small parquet files (< 1GB) laid out in
such a way that the catalog metadata can be used to efficiently identify
a much smaller set of files for a given query, and write amplification
can be avoided for inserts.
Having only ever used parquet in the context of data lake style systems,
the catalog approach comes more naturally to me and plays to parquet's
current strengths, however, this does not seem to be a universally held
expectation. I've frequently found people surprised when queries
performed in the absence of a catalog are slow, or who wish to
efficiently mutate or append to parquet files in place [2] [3] [4]. It
is possibly anecdotal but these expectations seem to be more common
where people are coming from python-based tooling such as pandas, and
might reflect weaker tooling support for catalog systems in this ecosystem.
Regardless this mismatch appears to be at the core of at least some of
the discussions about metadata. I do not think it a controversial take
that the current metadata structures are simply not setup for files on
the order of >1TB, where the metadata balloons to 10s or 100s of MB and
takes 10s of milliseconds just to parse. If this is in scope it would
justify major changes to the parquet metadata, however, I'm conscious
that for many users this responsibility is instead delegated to a
catalog that maintains its own index structures and statistics, only
relies on the parquet metadata for very late stage pruning, and may
therefore see limited benefit from revisiting the parquet metadata
structures.
I'd be very interested to hear other people's thoughts on this.
Kind Regards,
Raphael
[1]: https://github.com/apache/arrow-rs/issues/5770
[2]: https://github.com/apache/datafusion/issues/9654
[3]: https://github.com/datafusion-contrib/datafusion-objectstore-s3/pull/53
[4]: https://github.com/apache/arrow-rs/issues/557