Hi Fokko, I am aware of catalogs such as iceberg, my question was if in the design of parquet we can assume the existence of such a catalog.
Kind Regards, Raphael On 18 May 2024 16:18:22 BST, Fokko Driesprong <fo...@apache.org> wrote: >Hey Raphael, > >Thanks for reaching out here. Have you looked into table formats such as Apache >Iceberg <https://iceberg.apache.org/docs/nightly/>? This seems to fix the >problem that you're describing > >A table format adds an ACID layer to the file format and acts as a fully >functional database. In the case of Iceberg, a catalog is required for >atomicity, and alternatives like Delta Lake also seem to trend into that >direction ><https://github.com/orgs/delta-io/projects/10/views/1?pane=issue&itemId=57584023> >. > >I'm conscious that for many users this responsibility is instead delegated >> to a catalog that maintains its own index structures and statistics, only >> relies >> on the parquet metadata for very late stage pruning, and may therefore >> see limited benefit from revisiting the parquet metadata structures. > > >This is exactly what Iceberg offers, it provides additional metadata to >speed up the planning process: >https://iceberg.apache.org/docs/nightly/performance/ > >Kind regards, >Fokko > >Op za 18 mei 2024 om 16:40 schreef Raphael Taylor-Davies ><r.taylordav...@googlemail.com.invalid>: > >> Hi All, >> >> The recent discussions about metadata make me wonder where a storage >> format ends and a database begins, as people seem to have differing >> expectations of parquet here. In particular, one school of thought >> posits that parquet should suffice as a standalone technology, where >> users can write parquet files to a store and efficiently query them >> directly with no additional technologies. However, others instead view >> parquet as a storage format for use in conjunction with some sort of >> catalog / metastore. These two approaches naturally place very different >> demands on the parquet format. The former case incentivizes constructing >> extremely large parquet files, potentially on the order of TBs [1], such >> that the parquet metadata alone can efficiently be used to service a >> query without lots of random I/O to separate files. However, the latter >> case incentivizes relatively small parquet files (< 1GB) laid out in >> such a way that the catalog metadata can be used to efficiently identify >> a much smaller set of files for a given query, and write amplification >> can be avoided for inserts. >> >> Having only ever used parquet in the context of data lake style systems, >> the catalog approach comes more naturally to me and plays to parquet's >> current strengths, however, this does not seem to be a universally held >> expectation. I've frequently found people surprised when queries >> performed in the absence of a catalog are slow, or who wish to >> efficiently mutate or append to parquet files in place [2] [3] [4]. It >> is possibly anecdotal but these expectations seem to be more common >> where people are coming from python-based tooling such as pandas, and >> might reflect weaker tooling support for catalog systems in this ecosystem. >> >> Regardless this mismatch appears to be at the core of at least some of >> the discussions about metadata. I do not think it a controversial take >> that the current metadata structures are simply not setup for files on >> the order of >1TB, where the metadata balloons to 10s or 100s of MB and >> takes 10s of milliseconds just to parse. If this is in scope it would >> justify major changes to the parquet metadata, however, I'm conscious >> that for many users this responsibility is instead delegated to a >> catalog that maintains its own index structures and statistics, only >> relies on the parquet metadata for very late stage pruning, and may >> therefore see limited benefit from revisiting the parquet metadata >> structures. >> >> I'd be very interested to hear other people's thoughts on this. >> >> Kind Regards, >> >> Raphael >> >> [1]: https://github.com/apache/arrow-rs/issues/5770 >> [2]: https://github.com/apache/datafusion/issues/9654 >> [3]: >> https://github.com/datafusion-contrib/datafusion-objectstore-s3/pull/53 >> [4]: https://github.com/apache/arrow-rs/issues/557 >> >>