Re: Is Parquet Meant As a Standalone Database or is a Catalog/Metastore Required?

Rok Mihevc Wed, 22 May 2024 06:26:47 -0700

I have worked in small data science/engineering teams where time to do
engineering is often a luxury and ad hoc data transformations and analysis
are the norm. In such environments a format that requires a catalog for
efficient reads will be less effective than one that comes with batteries
and good defaults included.


Aside: a nice view into ad hoc parque workloads in the wild are kaggle
forums [1].

[1] https://www.kaggle.com/search?q=parquet

Rok

On Wed, May 22, 2024 at 12:43 AM Micah Kornfield <[email protected]>
wrote:

> From my perspective I think the answer is more or less both.  Even with
> only the data lake use-case we see a wide variety of files on what people
> would be considered to be pushing reasonable boundaries.  To some extent
> these might be solvable by having libraries have better defaults (e.g. only
> collecting/writing statistics by default for the first N columns).
>
>
>
> On Tue, May 21, 2024 at 12:56 PM Steve Loughran
> <[email protected]>
> wrote:
>
> > I wish people would use avro over CSV. Not just for the schema or more
> > complex structures, but because the parser recognises corrupt files. Oh,
> > and the well defined serialization formats for things like "string" and
> > "number"
> >
> > that said, I generate CSV in test/utility code because it is trivial do
> it
> > and then feed straight into a spreadsheet -I'm not trying to use it for
> > interchange
> >
> > On Sat, 18 May 2024 at 17:10, Curt Hagenlocher <[email protected]>
> > wrote:
> >
> > > While CSV is still the undisputed monarch of exchanging data via files,
> > > Parquet is arguably "top 3" -- and this is a scenario in which the file
> > > does really need to be self-contained.
> > >
> > > On Sat, May 18, 2024 at 9:01 AM Raphael Taylor-Davies
> > > <[email protected]> wrote:
> > >
> > > > Hi Fokko,
> > > >
> > > > I am aware of catalogs such as iceberg, my question was if in the
> > design
> > > > of parquet we can assume the existence of such a catalog.
> > > >
> > > > Kind Regards,
> > > >
> > > > Raphael
> > > >
> > > > On 18 May 2024 16:18:22 BST, Fokko Driesprong <[email protected]>
> > wrote:
> > > > >Hey Raphael,
> > > > >
> > > > >Thanks for reaching out here. Have you looked into table formats
> such
> > as
> > > > Apache
> > > > >Iceberg <https://iceberg.apache.org/docs/nightly/>? This seems to
> fix
> > > the
> > > > >problem that you're describing
> > > > >
> > > > >A table format adds an ACID layer to the file format and acts as a
> > fully
> > > > >functional database. In the case of Iceberg, a catalog is required
> for
> > > > >atomicity, and alternatives like Delta Lake also seem to trend into
> > that
> > > > >direction
> > > > ><
> > > >
> > >
> >
> https://github.com/orgs/delta-io/projects/10/views/1?pane=issue&itemId=57584023
> > > > >
> > > > >.
> > > > >
> > > > >I'm conscious that for many users this responsibility is instead
> > > delegated
> > > > >> to a catalog that maintains its own index structures and
> statistics,
> > > > only relies
> > > > >> on the parquet metadata for very late stage pruning, and may
> > therefore
> > > > >> see limited benefit from revisiting the parquet metadata
> structures.
> > > > >
> > > > >
> > > > >This is exactly what Iceberg offers, it provides additional metadata
> > to
> > > > >speed up the planning process:
> > > > >https://iceberg.apache.org/docs/nightly/performance/
> > > > >
> > > > >Kind regards,
> > > > >Fokko
> > > > >
> > > > >Op za 18 mei 2024 om 16:40 schreef Raphael Taylor-Davies
> > > > ><[email protected]>:
> > > > >
> > > > >> Hi All,
> > > > >>
> > > > >> The recent discussions about metadata make me wonder where a
> storage
> > > > >> format ends and a database begins, as people seem to have
> differing
> > > > >> expectations of parquet here. In particular, one school of thought
> > > > >> posits that parquet should suffice as a standalone technology,
> where
> > > > >> users can write parquet files to a store and efficiently query
> them
> > > > >> directly with no additional technologies. However, others instead
> > view
> > > > >> parquet as a storage format for use in conjunction with some sort
> of
> > > > >> catalog / metastore. These two approaches naturally place very
> > > different
> > > > >> demands on the parquet format. The former case incentivizes
> > > constructing
> > > > >> extremely large parquet files, potentially on the order of TBs
> [1],
> > > such
> > > > >> that the parquet metadata alone can efficiently be used to
> service a
> > > > >> query without lots of random I/O to separate files. However, the
> > > latter
> > > > >> case incentivizes relatively small parquet files (< 1GB) laid out
> in
> > > > >> such a way that the catalog metadata can be used to efficiently
> > > identify
> > > > >> a much smaller set of files for a given query, and write
> > amplification
> > > > >> can be avoided for inserts.
> > > > >>
> > > > >> Having only ever used parquet in the context of data lake style
> > > systems,
> > > > >> the catalog approach comes more naturally to me and plays to
> > parquet's
> > > > >> current strengths, however, this does not seem to be a universally
> > > held
> > > > >> expectation. I've frequently found people surprised when queries
> > > > >> performed in the absence of a catalog are slow, or who wish to
> > > > >> efficiently mutate or append to parquet files in place [2] [3]
> [4].
> > It
> > > > >> is possibly anecdotal but these expectations seem to be more
> common
> > > > >> where people are coming from python-based tooling such as pandas,
> > and
> > > > >> might reflect weaker tooling support for catalog systems in this
> > > > ecosystem.
> > > > >>
> > > > >> Regardless this mismatch appears to be at the core of at least
> some
> > of
> > > > >> the discussions about metadata. I do not think it a controversial
> > take
> > > > >> that the current metadata structures are simply not setup for
> files
> > on
> > > > >> the order of >1TB, where the metadata balloons to 10s or 100s of
> MB
> > > and
> > > > >> takes 10s of milliseconds just to parse. If this is in scope it
> > would
> > > > >> justify major changes to the parquet metadata, however, I'm
> > conscious
> > > > >> that for many users this responsibility is instead delegated to a
> > > > >> catalog that maintains its own index structures and statistics,
> only
> > > > >> relies on the parquet metadata for very late stage pruning, and
> may
> > > > >> therefore see limited benefit from revisiting the parquet metadata
> > > > >> structures.
> > > > >>
> > > > >> I'd be very interested to hear other people's thoughts on this.
> > > > >>
> > > > >> Kind Regards,
> > > > >>
> > > > >> Raphael
> > > > >>
> > > > >> [1]: https://github.com/apache/arrow-rs/issues/5770
> > > > >> [2]: https://github.com/apache/datafusion/issues/9654
> > > > >> [3]:
> > > > >>
> > >
> https://github.com/datafusion-contrib/datafusion-objectstore-s3/pull/53
> > > > >> [4]: https://github.com/apache/arrow-rs/issues/557
> > > > >>
> > > > >>
> > > >
> > >
> >
>

Re: Is Parquet Meant As a Standalone Database or is a Catalog/Metastore Required?

Reply via email to