Hello all, I work in environments where both usages exist. The single file approach at leat in this setting comes from the fact that a lot of input data for ML pipelines has been historically a single CSV fike dump. As also a lot of data analysis tools have been single-threaded, people are jusf way tol used to the single file approach. A lot of them simply don't know about the existence and benefits of table formats on top of Parquet.
In half of all cases, the single file approach actually seems sufficient to me, but once querying, multi-threading or larger data sets are involved, a table format would be much better. These peope are not so deep in the engineering "world" and thus continue to assume that the single file approach scales. I think, we should advertise the table formats a bit more in our documentation if we're anyways already working on it to avoid such questions coming up. I personally think that there is an upper limit per file depending on the use case. Once you go beyond that or have situations like updating your dataset while using it at the same time, you should definitely have a table format on top. Best Uwe On Sun, May 19, 2024, at 11:56 AM, Andrew Lamb wrote: >> and this is a scenario in which the file does really need to be > self-contained. > > What do you mean by "self-contained"? > > If the usecase is exchanging data via files, perhaps only the (relatively > small) metadata about types / how to read the file (rather than potentially > large min/max statistics) is required? > > If the usecase is replacing loading csv files into a database or other > system (and building indexes, etc during the load) so querying it is > faster, then the additional metadata seems warranted. > > I think the beauty of parquet is that is an efficient data exchange format > and comes with features that queries reasonably fast without requiring a > second system (e.g. database) to manage. However, if you want to have even > faster performance you can build a second system on top of parquet (with > catalogs / indexes, etc). > > BTW with systems like DataFusion, for example, it is relatively > straightforward to build an index that prunes parquet files based on > predicates and information stored in parquet metadata without even opening > the files at query time. See this example [1]. > > Andrew > > [1]" https://github.com/apache/datafusion/pull/10549 > > > On Sat, May 18, 2024 at 12:10 PM Curt Hagenlocher <[email protected]> > wrote: > >> While CSV is still the undisputed monarch of exchanging data via files, >> Parquet is arguably "top 3" -- and this is a scenario in which the file >> does really need to be self-contained. >> >> On Sat, May 18, 2024 at 9:01 AM Raphael Taylor-Davies >> <[email protected]> wrote: >> >> > Hi Fokko, >> > >> > I am aware of catalogs such as iceberg, my question was if in the design >> > of parquet we can assume the existence of such a catalog. >> > >> > Kind Regards, >> > >> > Raphael >> > >> > On 18 May 2024 16:18:22 BST, Fokko Driesprong <[email protected]> wrote: >> > >Hey Raphael, >> > > >> > >Thanks for reaching out here. Have you looked into table formats such as >> > Apache >> > >Iceberg <https://iceberg.apache.org/docs/nightly/>? This seems to fix >> the >> > >problem that you're describing >> > > >> > >A table format adds an ACID layer to the file format and acts as a fully >> > >functional database. In the case of Iceberg, a catalog is required for >> > >atomicity, and alternatives like Delta Lake also seem to trend into that >> > >direction >> > >< >> > >> https://github.com/orgs/delta-io/projects/10/views/1?pane=issue&itemId=57584023 >> > > >> > >. >> > > >> > >I'm conscious that for many users this responsibility is instead >> delegated >> > >> to a catalog that maintains its own index structures and statistics, >> > only relies >> > >> on the parquet metadata for very late stage pruning, and may therefore >> > >> see limited benefit from revisiting the parquet metadata structures. >> > > >> > > >> > >This is exactly what Iceberg offers, it provides additional metadata to >> > >speed up the planning process: >> > >https://iceberg.apache.org/docs/nightly/performance/ >> > > >> > >Kind regards, >> > >Fokko >> > > >> > >Op za 18 mei 2024 om 16:40 schreef Raphael Taylor-Davies >> > ><[email protected]>: >> > > >> > >> Hi All, >> > >> >> > >> The recent discussions about metadata make me wonder where a storage >> > >> format ends and a database begins, as people seem to have differing >> > >> expectations of parquet here. In particular, one school of thought >> > >> posits that parquet should suffice as a standalone technology, where >> > >> users can write parquet files to a store and efficiently query them >> > >> directly with no additional technologies. However, others instead view >> > >> parquet as a storage format for use in conjunction with some sort of >> > >> catalog / metastore. These two approaches naturally place very >> different >> > >> demands on the parquet format. The former case incentivizes >> constructing >> > >> extremely large parquet files, potentially on the order of TBs [1], >> such >> > >> that the parquet metadata alone can efficiently be used to service a >> > >> query without lots of random I/O to separate files. However, the >> latter >> > >> case incentivizes relatively small parquet files (< 1GB) laid out in >> > >> such a way that the catalog metadata can be used to efficiently >> identify >> > >> a much smaller set of files for a given query, and write amplification >> > >> can be avoided for inserts. >> > >> >> > >> Having only ever used parquet in the context of data lake style >> systems, >> > >> the catalog approach comes more naturally to me and plays to parquet's >> > >> current strengths, however, this does not seem to be a universally >> held >> > >> expectation. I've frequently found people surprised when queries >> > >> performed in the absence of a catalog are slow, or who wish to >> > >> efficiently mutate or append to parquet files in place [2] [3] [4]. It >> > >> is possibly anecdotal but these expectations seem to be more common >> > >> where people are coming from python-based tooling such as pandas, and >> > >> might reflect weaker tooling support for catalog systems in this >> > ecosystem. >> > >> >> > >> Regardless this mismatch appears to be at the core of at least some of >> > >> the discussions about metadata. I do not think it a controversial take >> > >> that the current metadata structures are simply not setup for files on >> > >> the order of >1TB, where the metadata balloons to 10s or 100s of MB >> and >> > >> takes 10s of milliseconds just to parse. If this is in scope it would >> > >> justify major changes to the parquet metadata, however, I'm conscious >> > >> that for many users this responsibility is instead delegated to a >> > >> catalog that maintains its own index structures and statistics, only >> > >> relies on the parquet metadata for very late stage pruning, and may >> > >> therefore see limited benefit from revisiting the parquet metadata >> > >> structures. >> > >> >> > >> I'd be very interested to hear other people's thoughts on this. >> > >> >> > >> Kind Regards, >> > >> >> > >> Raphael >> > >> >> > >> [1]: https://github.com/apache/arrow-rs/issues/5770 >> > >> [2]: https://github.com/apache/datafusion/issues/9654 >> > >> [3]: >> > >> >> https://github.com/datafusion-contrib/datafusion-objectstore-s3/pull/53 >> > >> [4]: https://github.com/apache/arrow-rs/issues/557 >> > >> >> > >> >> > >>
