While CSV is still the undisputed monarch of exchanging data via files, Parquet is arguably "top 3" -- and this is a scenario in which the file does really need to be self-contained.
On Sat, May 18, 2024 at 9:01 AM Raphael Taylor-Davies <[email protected]> wrote: > Hi Fokko, > > I am aware of catalogs such as iceberg, my question was if in the design > of parquet we can assume the existence of such a catalog. > > Kind Regards, > > Raphael > > On 18 May 2024 16:18:22 BST, Fokko Driesprong <[email protected]> wrote: > >Hey Raphael, > > > >Thanks for reaching out here. Have you looked into table formats such as > Apache > >Iceberg <https://iceberg.apache.org/docs/nightly/>? This seems to fix the > >problem that you're describing > > > >A table format adds an ACID layer to the file format and acts as a fully > >functional database. In the case of Iceberg, a catalog is required for > >atomicity, and alternatives like Delta Lake also seem to trend into that > >direction > >< > https://github.com/orgs/delta-io/projects/10/views/1?pane=issue&itemId=57584023 > > > >. > > > >I'm conscious that for many users this responsibility is instead delegated > >> to a catalog that maintains its own index structures and statistics, > only relies > >> on the parquet metadata for very late stage pruning, and may therefore > >> see limited benefit from revisiting the parquet metadata structures. > > > > > >This is exactly what Iceberg offers, it provides additional metadata to > >speed up the planning process: > >https://iceberg.apache.org/docs/nightly/performance/ > > > >Kind regards, > >Fokko > > > >Op za 18 mei 2024 om 16:40 schreef Raphael Taylor-Davies > ><[email protected]>: > > > >> Hi All, > >> > >> The recent discussions about metadata make me wonder where a storage > >> format ends and a database begins, as people seem to have differing > >> expectations of parquet here. In particular, one school of thought > >> posits that parquet should suffice as a standalone technology, where > >> users can write parquet files to a store and efficiently query them > >> directly with no additional technologies. However, others instead view > >> parquet as a storage format for use in conjunction with some sort of > >> catalog / metastore. These two approaches naturally place very different > >> demands on the parquet format. The former case incentivizes constructing > >> extremely large parquet files, potentially on the order of TBs [1], such > >> that the parquet metadata alone can efficiently be used to service a > >> query without lots of random I/O to separate files. However, the latter > >> case incentivizes relatively small parquet files (< 1GB) laid out in > >> such a way that the catalog metadata can be used to efficiently identify > >> a much smaller set of files for a given query, and write amplification > >> can be avoided for inserts. > >> > >> Having only ever used parquet in the context of data lake style systems, > >> the catalog approach comes more naturally to me and plays to parquet's > >> current strengths, however, this does not seem to be a universally held > >> expectation. I've frequently found people surprised when queries > >> performed in the absence of a catalog are slow, or who wish to > >> efficiently mutate or append to parquet files in place [2] [3] [4]. It > >> is possibly anecdotal but these expectations seem to be more common > >> where people are coming from python-based tooling such as pandas, and > >> might reflect weaker tooling support for catalog systems in this > ecosystem. > >> > >> Regardless this mismatch appears to be at the core of at least some of > >> the discussions about metadata. I do not think it a controversial take > >> that the current metadata structures are simply not setup for files on > >> the order of >1TB, where the metadata balloons to 10s or 100s of MB and > >> takes 10s of milliseconds just to parse. If this is in scope it would > >> justify major changes to the parquet metadata, however, I'm conscious > >> that for many users this responsibility is instead delegated to a > >> catalog that maintains its own index structures and statistics, only > >> relies on the parquet metadata for very late stage pruning, and may > >> therefore see limited benefit from revisiting the parquet metadata > >> structures. > >> > >> I'd be very interested to hear other people's thoughts on this. > >> > >> Kind Regards, > >> > >> Raphael > >> > >> [1]: https://github.com/apache/arrow-rs/issues/5770 > >> [2]: https://github.com/apache/datafusion/issues/9654 > >> [3]: > >> https://github.com/datafusion-contrib/datafusion-objectstore-s3/pull/53 > >> [4]: https://github.com/apache/arrow-rs/issues/557 > >> > >> >
