I have worked in small data science/engineering teams where time to do engineering is often a luxury and ad hoc data transformations and analysis are the norm. In such environments a format that requires a catalog for efficient reads will be less effective than one that comes with batteries and good defaults included.
Aside: a nice view into ad hoc parque workloads in the wild are kaggle forums [1]. [1] https://www.kaggle.com/search?q=parquet Rok On Wed, May 22, 2024 at 12:43 AM Micah Kornfield <[email protected]> wrote: > From my perspective I think the answer is more or less both. Even with > only the data lake use-case we see a wide variety of files on what people > would be considered to be pushing reasonable boundaries. To some extent > these might be solvable by having libraries have better defaults (e.g. only > collecting/writing statistics by default for the first N columns). > > > > On Tue, May 21, 2024 at 12:56 PM Steve Loughran > <[email protected]> > wrote: > > > I wish people would use avro over CSV. Not just for the schema or more > > complex structures, but because the parser recognises corrupt files. Oh, > > and the well defined serialization formats for things like "string" and > > "number" > > > > that said, I generate CSV in test/utility code because it is trivial do > it > > and then feed straight into a spreadsheet -I'm not trying to use it for > > interchange > > > > On Sat, 18 May 2024 at 17:10, Curt Hagenlocher <[email protected]> > > wrote: > > > > > While CSV is still the undisputed monarch of exchanging data via files, > > > Parquet is arguably "top 3" -- and this is a scenario in which the file > > > does really need to be self-contained. > > > > > > On Sat, May 18, 2024 at 9:01 AM Raphael Taylor-Davies > > > <[email protected]> wrote: > > > > > > > Hi Fokko, > > > > > > > > I am aware of catalogs such as iceberg, my question was if in the > > design > > > > of parquet we can assume the existence of such a catalog. > > > > > > > > Kind Regards, > > > > > > > > Raphael > > > > > > > > On 18 May 2024 16:18:22 BST, Fokko Driesprong <[email protected]> > > wrote: > > > > >Hey Raphael, > > > > > > > > > >Thanks for reaching out here. Have you looked into table formats > such > > as > > > > Apache > > > > >Iceberg <https://iceberg.apache.org/docs/nightly/>? This seems to > fix > > > the > > > > >problem that you're describing > > > > > > > > > >A table format adds an ACID layer to the file format and acts as a > > fully > > > > >functional database. In the case of Iceberg, a catalog is required > for > > > > >atomicity, and alternatives like Delta Lake also seem to trend into > > that > > > > >direction > > > > >< > > > > > > > > > > https://github.com/orgs/delta-io/projects/10/views/1?pane=issue&itemId=57584023 > > > > > > > > > >. > > > > > > > > > >I'm conscious that for many users this responsibility is instead > > > delegated > > > > >> to a catalog that maintains its own index structures and > statistics, > > > > only relies > > > > >> on the parquet metadata for very late stage pruning, and may > > therefore > > > > >> see limited benefit from revisiting the parquet metadata > structures. > > > > > > > > > > > > > > >This is exactly what Iceberg offers, it provides additional metadata > > to > > > > >speed up the planning process: > > > > >https://iceberg.apache.org/docs/nightly/performance/ > > > > > > > > > >Kind regards, > > > > >Fokko > > > > > > > > > >Op za 18 mei 2024 om 16:40 schreef Raphael Taylor-Davies > > > > ><[email protected]>: > > > > > > > > > >> Hi All, > > > > >> > > > > >> The recent discussions about metadata make me wonder where a > storage > > > > >> format ends and a database begins, as people seem to have > differing > > > > >> expectations of parquet here. In particular, one school of thought > > > > >> posits that parquet should suffice as a standalone technology, > where > > > > >> users can write parquet files to a store and efficiently query > them > > > > >> directly with no additional technologies. However, others instead > > view > > > > >> parquet as a storage format for use in conjunction with some sort > of > > > > >> catalog / metastore. These two approaches naturally place very > > > different > > > > >> demands on the parquet format. The former case incentivizes > > > constructing > > > > >> extremely large parquet files, potentially on the order of TBs > [1], > > > such > > > > >> that the parquet metadata alone can efficiently be used to > service a > > > > >> query without lots of random I/O to separate files. However, the > > > latter > > > > >> case incentivizes relatively small parquet files (< 1GB) laid out > in > > > > >> such a way that the catalog metadata can be used to efficiently > > > identify > > > > >> a much smaller set of files for a given query, and write > > amplification > > > > >> can be avoided for inserts. > > > > >> > > > > >> Having only ever used parquet in the context of data lake style > > > systems, > > > > >> the catalog approach comes more naturally to me and plays to > > parquet's > > > > >> current strengths, however, this does not seem to be a universally > > > held > > > > >> expectation. I've frequently found people surprised when queries > > > > >> performed in the absence of a catalog are slow, or who wish to > > > > >> efficiently mutate or append to parquet files in place [2] [3] > [4]. > > It > > > > >> is possibly anecdotal but these expectations seem to be more > common > > > > >> where people are coming from python-based tooling such as pandas, > > and > > > > >> might reflect weaker tooling support for catalog systems in this > > > > ecosystem. > > > > >> > > > > >> Regardless this mismatch appears to be at the core of at least > some > > of > > > > >> the discussions about metadata. I do not think it a controversial > > take > > > > >> that the current metadata structures are simply not setup for > files > > on > > > > >> the order of >1TB, where the metadata balloons to 10s or 100s of > MB > > > and > > > > >> takes 10s of milliseconds just to parse. If this is in scope it > > would > > > > >> justify major changes to the parquet metadata, however, I'm > > conscious > > > > >> that for many users this responsibility is instead delegated to a > > > > >> catalog that maintains its own index structures and statistics, > only > > > > >> relies on the parquet metadata for very late stage pruning, and > may > > > > >> therefore see limited benefit from revisiting the parquet metadata > > > > >> structures. > > > > >> > > > > >> I'd be very interested to hear other people's thoughts on this. > > > > >> > > > > >> Kind Regards, > > > > >> > > > > >> Raphael > > > > >> > > > > >> [1]: https://github.com/apache/arrow-rs/issues/5770 > > > > >> [2]: https://github.com/apache/datafusion/issues/9654 > > > > >> [3]: > > > > >> > > > > https://github.com/datafusion-contrib/datafusion-objectstore-s3/pull/53 > > > > >> [4]: https://github.com/apache/arrow-rs/issues/557 > > > > >> > > > > >> > > > > > > > > > >
