Re: Is Parquet Meant As a Standalone Database or is a Catalog/Metastore Required?

Steve Loughran Tue, 21 May 2024 12:56:36 -0700

I wish people would use avro over CSV. Not just for the schema or more
complex structures, but because the parser recognises corrupt files. Oh,
and the well defined serialization formats for things like "string" and
"number"


that said, I generate CSV in test/utility code because it is trivial do it
and then feed straight into a spreadsheet -I'm not trying to use it for
interchange

On Sat, 18 May 2024 at 17:10, Curt Hagenlocher <c...@hagenlocher.org> wrote:

> While CSV is still the undisputed monarch of exchanging data via files,
> Parquet is arguably "top 3" -- and this is a scenario in which the file
> does really need to be self-contained.
>
> On Sat, May 18, 2024 at 9:01 AM Raphael Taylor-Davies
> <r.taylordav...@googlemail.com.invalid> wrote:
>
> > Hi Fokko,
> >
> > I am aware of catalogs such as iceberg, my question was if in the design
> > of parquet we can assume the existence of such a catalog.
> >
> > Kind Regards,
> >
> > Raphael
> >
> > On 18 May 2024 16:18:22 BST, Fokko Driesprong <fo...@apache.org> wrote:
> > >Hey Raphael,
> > >
> > >Thanks for reaching out here. Have you looked into table formats such as
> > Apache
> > >Iceberg <https://iceberg.apache.org/docs/nightly/>? This seems to fix
> the
> > >problem that you're describing
> > >
> > >A table format adds an ACID layer to the file format and acts as a fully
> > >functional database. In the case of Iceberg, a catalog is required for
> > >atomicity, and alternatives like Delta Lake also seem to trend into that
> > >direction
> > ><
> >
> https://github.com/orgs/delta-io/projects/10/views/1?pane=issue&itemId=57584023
> > >
> > >.
> > >
> > >I'm conscious that for many users this responsibility is instead
> delegated
> > >> to a catalog that maintains its own index structures and statistics,
> > only relies
> > >> on the parquet metadata for very late stage pruning, and may therefore
> > >> see limited benefit from revisiting the parquet metadata structures.
> > >
> > >
> > >This is exactly what Iceberg offers, it provides additional metadata to
> > >speed up the planning process:
> > >https://iceberg.apache.org/docs/nightly/performance/
> > >
> > >Kind regards,
> > >Fokko
> > >
> > >Op za 18 mei 2024 om 16:40 schreef Raphael Taylor-Davies
> > ><r.taylordav...@googlemail.com.invalid>:
> > >
> > >> Hi All,
> > >>
> > >> The recent discussions about metadata make me wonder where a storage
> > >> format ends and a database begins, as people seem to have differing
> > >> expectations of parquet here. In particular, one school of thought
> > >> posits that parquet should suffice as a standalone technology, where
> > >> users can write parquet files to a store and efficiently query them
> > >> directly with no additional technologies. However, others instead view
> > >> parquet as a storage format for use in conjunction with some sort of
> > >> catalog / metastore. These two approaches naturally place very
> different
> > >> demands on the parquet format. The former case incentivizes
> constructing
> > >> extremely large parquet files, potentially on the order of TBs [1],
> such
> > >> that the parquet metadata alone can efficiently be used to service a
> > >> query without lots of random I/O to separate files. However, the
> latter
> > >> case incentivizes relatively small parquet files (< 1GB) laid out in
> > >> such a way that the catalog metadata can be used to efficiently
> identify
> > >> a much smaller set of files for a given query, and write amplification
> > >> can be avoided for inserts.
> > >>
> > >> Having only ever used parquet in the context of data lake style
> systems,
> > >> the catalog approach comes more naturally to me and plays to parquet's
> > >> current strengths, however, this does not seem to be a universally
> held
> > >> expectation. I've frequently found people surprised when queries
> > >> performed in the absence of a catalog are slow, or who wish to
> > >> efficiently mutate or append to parquet files in place [2] [3] [4]. It
> > >> is possibly anecdotal but these expectations seem to be more common
> > >> where people are coming from python-based tooling such as pandas, and
> > >> might reflect weaker tooling support for catalog systems in this
> > ecosystem.
> > >>
> > >> Regardless this mismatch appears to be at the core of at least some of
> > >> the discussions about metadata. I do not think it a controversial take
> > >> that the current metadata structures are simply not setup for files on
> > >> the order of >1TB, where the metadata balloons to 10s or 100s of MB
> and
> > >> takes 10s of milliseconds just to parse. If this is in scope it would
> > >> justify major changes to the parquet metadata, however, I'm conscious
> > >> that for many users this responsibility is instead delegated to a
> > >> catalog that maintains its own index structures and statistics, only
> > >> relies on the parquet metadata for very late stage pruning, and may
> > >> therefore see limited benefit from revisiting the parquet metadata
> > >> structures.
> > >>
> > >> I'd be very interested to hear other people's thoughts on this.
> > >>
> > >> Kind Regards,
> > >>
> > >> Raphael
> > >>
> > >> [1]: https://github.com/apache/arrow-rs/issues/5770
> > >> [2]: https://github.com/apache/datafusion/issues/9654
> > >> [3]:
> > >>
> https://github.com/datafusion-contrib/datafusion-objectstore-s3/pull/53
> > >> [4]: https://github.com/apache/arrow-rs/issues/557
> > >>
> > >>
> >
>

Re: Is Parquet Meant As a Standalone Database or is a Catalog/Metastore Required?

Reply via email to