Re: Is Parquet Meant As a Standalone Database or is a Catalog/Metastore Required?

2024-05-24 Thread Steve Loughran
one key aspect is that datalake query engines often see data for the first time when queried, I remember some dremel/bigtable paper talking about it. stats in the summary means that they can immediately do some effective query planning, without any import phase. Without that they need to collect

Re: Is Parquet Meant As a Standalone Database or is a Catalog/Metastore Required?

2024-05-23 Thread Julien Le Dem
I would agree it's a bit of both. The metadata overhead (per data volume) doesn't increase when you have fewer files. That being said, you could use fewer of the metadata features in that use case if the goal is to exchange well formed data without ambiguity. For wide schema it would be useful to

Re: Is Parquet Meant As a Standalone Database or is a Catalog/Metastore Required?

2024-05-22 Thread Rok Mihevc
I have worked in small data science/engineering teams where time to do engineering is often a luxury and ad hoc data transformations and analysis are the norm. In such environments a format that requires a catalog for efficient reads will be less effective than one that comes with batteries and

Re: Is Parquet Meant As a Standalone Database or is a Catalog/Metastore Required?

2024-05-21 Thread Micah Kornfield
>From my perspective I think the answer is more or less both. Even with only the data lake use-case we see a wide variety of files on what people would be considered to be pushing reasonable boundaries. To some extent these might be solvable by having libraries have better defaults (e.g. only

Re: Is Parquet Meant As a Standalone Database or is a Catalog/Metastore Required?

2024-05-21 Thread Steve Loughran
I wish people would use avro over CSV. Not just for the schema or more complex structures, but because the parser recognises corrupt files. Oh, and the well defined serialization formats for things like "string" and "number" that said, I generate CSV in test/utility code because it is trivial do

Re: Is Parquet Meant As a Standalone Database or is a Catalog/Metastore Required?

2024-05-20 Thread Uwe L. Korn
Hello all, I work in environments where both usages exist. The single file approach at leat in this setting comes from the fact that a lot of input data for ML pipelines has been historically a single CSV fike dump. As also a lot of data analysis tools have been single-threaded, people are

Re: Is Parquet Meant As a Standalone Database or is a Catalog/Metastore Required?

2024-05-19 Thread Andrew Lamb
> and this is a scenario in which the file does really need to be self-contained. What do you mean by "self-contained"? If the usecase is exchanging data via files, perhaps only the (relatively small) metadata about types / how to read the file (rather than potentially large min/max statistics)

Re: Is Parquet Meant As a Standalone Database or is a Catalog/Metastore Required?

2024-05-18 Thread Curt Hagenlocher
While CSV is still the undisputed monarch of exchanging data via files, Parquet is arguably "top 3" -- and this is a scenario in which the file does really need to be self-contained. On Sat, May 18, 2024 at 9:01 AM Raphael Taylor-Davies wrote: > Hi Fokko, > > I am aware of catalogs such as

Re: Is Parquet Meant As a Standalone Database or is a Catalog/Metastore Required?

2024-05-18 Thread Raphael Taylor-Davies
Hi Fokko, I am aware of catalogs such as iceberg, my question was if in the design of parquet we can assume the existence of such a catalog. Kind Regards, Raphael On 18 May 2024 16:18:22 BST, Fokko Driesprong wrote: >Hey Raphael, > >Thanks for reaching out here. Have you looked into table

Re: Is Parquet Meant As a Standalone Database or is a Catalog/Metastore Required?

2024-05-18 Thread Fokko Driesprong
Hey Raphael, Thanks for reaching out here. Have you looked into table formats such as Apache Iceberg ? This seems to fix the problem that you're describing A table format adds an ACID layer to the file format and acts as a fully functional database. In

Is Parquet Meant As a Standalone Database or is a Catalog/Metastore Required?

2024-05-18 Thread Raphael Taylor-Davies
Hi All, The recent discussions about metadata make me wonder where a storage format ends and a database begins, as people seem to have differing expectations of parquet here. In particular, one school of thought posits that parquet should suffice as a standalone technology, where users can