one key aspect is that datalake query engines often see data for the first
time when queried, I remember some dremel/bigtable paper talking about it.
stats in the summary means that they can immediately do some effective
query planning, without any import phase. Without that they need to collect
stats by
1. scanning the tables and collecting stats, either on immediate import or
as a background job.
2. building up the stats as the initial queries are executed and saving the
results (\cite{Seamless Integration of Parquet Files into Data Processing,
2023
<https://dl.gi.de/server/api/core/bitstreams/9c8435ee-d478-4b0e-9e3f-94f39a9e7090/content>
).
These can workaround parquet files without stats,(those generated by apps
other than spark, in the paper), but can also build up extra data about
selectivity, frequency etc. So can improver performance. But the min/max
values are so foundational for queries across timestamps of log files,
everything should have them for best out the box performance.
On Fri, 24 May 2024 at 04:47, Julien Le Dem <[email protected]> wrote:
> I would agree it's a bit of both. The metadata overhead (per data volume)
> doesn't increase when you have fewer files.
> That being said, you could use fewer of the metadata features in that use
> case if the goal is to exchange well formed data without ambiguity.
> For wide schema it would be useful to not have to read metadata for columns
> you are not reading.
>
> On Wed, May 22, 2024 at 9:26 AM Rok Mihevc <[email protected]> wrote:
>
> > I have worked in small data science/engineering teams where time to do
> > engineering is often a luxury and ad hoc data transformations and
> analysis
> > are the norm. In such environments a format that requires a catalog for
> > efficient reads will be less effective than one that comes with batteries
> > and good defaults included.
> >
> > Aside: a nice view into ad hoc parque workloads in the wild are kaggle
> > forums [1].
> >
> > [1] https://www.kaggle.com/search?q=parquet
> >
> > Rok
> >
> > On Wed, May 22, 2024 at 12:43 AM Micah Kornfield <[email protected]>
> > wrote:
> >
> > > From my perspective I think the answer is more or less both. Even with
> > > only the data lake use-case we see a wide variety of files on what
> people
> > > would be considered to be pushing reasonable boundaries. To some
> extent
> > > these might be solvable by having libraries have better defaults (e.g.
> > only
> > > collecting/writing statistics by default for the first N columns).
> > >
> > >
> > >
> > > On Tue, May 21, 2024 at 12:56 PM Steve Loughran
> > > <[email protected]>
> > > wrote:
> > >
> > > > I wish people would use avro over CSV. Not just for the schema or
> more
> > > > complex structures, but because the parser recognises corrupt files.
> > Oh,
> > > > and the well defined serialization formats for things like "string"
> and
> > > > "number"
> > > >
> > > > that said, I generate CSV in test/utility code because it is trivial
> do
> > > it
> > > > and then feed straight into a spreadsheet -I'm not trying to use it
> for
> > > > interchange
> > > >
> > > > On Sat, 18 May 2024 at 17:10, Curt Hagenlocher <[email protected]
> >
> > > > wrote:
> > > >
> > > > > While CSV is still the undisputed monarch of exchanging data via
> > files,
> > > > > Parquet is arguably "top 3" -- and this is a scenario in which the
> > file
> > > > > does really need to be self-contained.
> > > > >
> > > > > On Sat, May 18, 2024 at 9:01 AM Raphael Taylor-Davies
> > > > > <[email protected]> wrote:
> > > > >
> > > > > > Hi Fokko,
> > > > > >
> > > > > > I am aware of catalogs such as iceberg, my question was if in the
> > > > design
> > > > > > of parquet we can assume the existence of such a catalog.
> > > > > >
> > > > > > Kind Regards,
> > > > > >
> > > > > > Raphael
> > > > > >
> > > > > > On 18 May 2024 16:18:22 BST, Fokko Driesprong <[email protected]>
> > > > wrote:
> > > > > > >Hey Raphael,
> > > > > > >
> > > > > > >Thanks for reaching out here. Have you looked into table formats
> > > such
> > > > as
> > > > > > Apache
> > > > > > >Iceberg <https://iceberg.apache.org/docs/nightly/>? This seems
> to
> > > fix
> > > > > the
> > > > > > >problem that you're describing
> > > > > > >
> > > > > > >A table format adds an ACID layer to the file format and acts
> as a
> > > > fully
> > > > > > >functional database. In the case of Iceberg, a catalog is
> required
> > > for
> > > > > > >atomicity, and alternatives like Delta Lake also seem to trend
> > into
> > > > that
> > > > > > >direction
> > > > > > ><
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/orgs/delta-io/projects/10/views/1?pane=issue&itemId=57584023
> > > > > > >
> > > > > > >.
> > > > > > >
> > > > > > >I'm conscious that for many users this responsibility is instead
> > > > > delegated
> > > > > > >> to a catalog that maintains its own index structures and
> > > statistics,
> > > > > > only relies
> > > > > > >> on the parquet metadata for very late stage pruning, and may
> > > > therefore
> > > > > > >> see limited benefit from revisiting the parquet metadata
> > > structures.
> > > > > > >
> > > > > > >
> > > > > > >This is exactly what Iceberg offers, it provides additional
> > metadata
> > > > to
> > > > > > >speed up the planning process:
> > > > > > >https://iceberg.apache.org/docs/nightly/performance/
> > > > > > >
> > > > > > >Kind regards,
> > > > > > >Fokko
> > > > > > >
> > > > > > >Op za 18 mei 2024 om 16:40 schreef Raphael Taylor-Davies
> > > > > > ><[email protected]>:
> > > > > > >
> > > > > > >> Hi All,
> > > > > > >>
> > > > > > >> The recent discussions about metadata make me wonder where a
> > > storage
> > > > > > >> format ends and a database begins, as people seem to have
> > > differing
> > > > > > >> expectations of parquet here. In particular, one school of
> > thought
> > > > > > >> posits that parquet should suffice as a standalone technology,
> > > where
> > > > > > >> users can write parquet files to a store and efficiently query
> > > them
> > > > > > >> directly with no additional technologies. However, others
> > instead
> > > > view
> > > > > > >> parquet as a storage format for use in conjunction with some
> > sort
> > > of
> > > > > > >> catalog / metastore. These two approaches naturally place very
> > > > > different
> > > > > > >> demands on the parquet format. The former case incentivizes
> > > > > constructing
> > > > > > >> extremely large parquet files, potentially on the order of TBs
> > > [1],
> > > > > such
> > > > > > >> that the parquet metadata alone can efficiently be used to
> > > service a
> > > > > > >> query without lots of random I/O to separate files. However,
> the
> > > > > latter
> > > > > > >> case incentivizes relatively small parquet files (< 1GB) laid
> > out
> > > in
> > > > > > >> such a way that the catalog metadata can be used to
> efficiently
> > > > > identify
> > > > > > >> a much smaller set of files for a given query, and write
> > > > amplification
> > > > > > >> can be avoided for inserts.
> > > > > > >>
> > > > > > >> Having only ever used parquet in the context of data lake
> style
> > > > > systems,
> > > > > > >> the catalog approach comes more naturally to me and plays to
> > > > parquet's
> > > > > > >> current strengths, however, this does not seem to be a
> > universally
> > > > > held
> > > > > > >> expectation. I've frequently found people surprised when
> queries
> > > > > > >> performed in the absence of a catalog are slow, or who wish to
> > > > > > >> efficiently mutate or append to parquet files in place [2] [3]
> > > [4].
> > > > It
> > > > > > >> is possibly anecdotal but these expectations seem to be more
> > > common
> > > > > > >> where people are coming from python-based tooling such as
> > pandas,
> > > > and
> > > > > > >> might reflect weaker tooling support for catalog systems in
> this
> > > > > > ecosystem.
> > > > > > >>
> > > > > > >> Regardless this mismatch appears to be at the core of at least
> > > some
> > > > of
> > > > > > >> the discussions about metadata. I do not think it a
> > controversial
> > > > take
> > > > > > >> that the current metadata structures are simply not setup for
> > > files
> > > > on
> > > > > > >> the order of >1TB, where the metadata balloons to 10s or 100s
> of
> > > MB
> > > > > and
> > > > > > >> takes 10s of milliseconds just to parse. If this is in scope
> it
> > > > would
> > > > > > >> justify major changes to the parquet metadata, however, I'm
> > > > conscious
> > > > > > >> that for many users this responsibility is instead delegated
> to
> > a
> > > > > > >> catalog that maintains its own index structures and
> statistics,
> > > only
> > > > > > >> relies on the parquet metadata for very late stage pruning,
> and
> > > may
> > > > > > >> therefore see limited benefit from revisiting the parquet
> > metadata
> > > > > > >> structures.
> > > > > > >>
> > > > > > >> I'd be very interested to hear other people's thoughts on
> this.
> > > > > > >>
> > > > > > >> Kind Regards,
> > > > > > >>
> > > > > > >> Raphael
> > > > > > >>
> > > > > > >> [1]: https://github.com/apache/arrow-rs/issues/5770
> > > > > > >> [2]: https://github.com/apache/datafusion/issues/9654
> > > > > > >> [3]:
> > > > > > >>
> > > > >
> > >
> https://github.com/datafusion-contrib/datafusion-objectstore-s3/pull/53
> > > > > > >> [4]: https://github.com/apache/arrow-rs/issues/557
> > > > > > >>
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
>