There is no objection received in a week. I will merge the mentioned PR to
enable size stats by default in the parquet-cpp.

Thanks all!

On Wed, Jan 15, 2025 at 5:06 AM Andrew Lamb <[email protected]> wrote:

> I believe DuckDB has their own custom parquet implementation[1].
>
> [1]:
>
> https://github.com/duckdb/duckdb/blob/26cb7178fd89f924a936874e5c09ec1f6df8a0a4/extension/parquet/parquet_extension.cpp#L88
>
> On Tue, Jan 14, 2025 at 3:11 PM Steve Loughran <[email protected]
> >
> wrote:
>
> > Is this the library used by DuckDB? As I've heard that it doesn't add
> > statistics to parquet files, which is unfortunate
> >
> > On Tue, 14 Jan 2025 at 15:13, Andrew Lamb <[email protected]>
> wrote:
> >
> > > I believe Ed added these statistics into parquet-rs[1] as well. We have
> > > also enabled them by default and haven't seen any performance issues.
> > >
> > > Andrew
> > >
> > > [1] https://github.com/apache/arrow-rs/pull/6105
> > >
> > > On Tue, Jan 14, 2025 at 9:38 AM Gang Wu <[email protected]> wrote:
> > >
> > > > Hi,
> > > >
> > > > The C++ Parquet implementation in the Apache Arrow (namely the
> > > parquet-cpp)
> > > > has
> > > > added Page Index support since 13.0.0. Recently SizeStatistics
> support
> > is
> > > > also
> > > > added in 19.0.0. Both features are disabled by default. We did a
> > > benchmark
> > > > and
> > > > the result showed that we can enable them by default with acceptable
> > > > penalties.
> > > > Therefore I opened a PR [1] to turn on them by default. The benchmark
> > > > result
> > > > is also available in this PR. Any feedback is welcome. If there is no
> > > > objection,
> > > > we will merge this PR and release it with Apache Arrow 20.0.0.
> > > >
> > > > [1] https://github.com/apache/arrow/pull/45249
> > > >
> > > > Best,
> > > > Gang
> > > >
> > >
> >
>

Reply via email to