?interesting q here.

TPC benchmarks do give different numbers for different file sizes,
independent of the nominal TPC scale (e.g different values for 10TB
numbers, with everything else the same)

I know it's all so dependent on cluster, app etc -but what sizes do people
use in (a) benchmarks and (b) production datasets? Or at least: what
minimum sizes show up as very inefficient, what large sizes seem to show no
incremental benefit .

The minimum size is going to be so significant for distributed engines like
Spark, as there's the work setup costs, but so does using cloud storage as
the data lake -there's overhead in simply opening files and reading footers
which will penalise the files. Parquet through DuckDb is inevitably going
to be very different

papers with empirical data welcome..


Steve


On Wed, 28 May 2025 at 17:52, Ashish Singh <[email protected]>
wrote:

> Thanks all!
>
> Yea, I am mostly looking at available tooling to tune parquet files.
>
> Ed, I would be interested to discuss this. Would you (or anyone else) like
> to have a dedicated discussion on this? To provide some context, at
> Pinterest we are actively looking into adopting/ building such tooling. We,
> like others, have been traditionally relying on manual tuning so far, which
> isn't really scalable.
>
> Best Regards,
> Ashish
>
>
> On Wed, May 28, 2025 at 9:29 AM Ed Seidl <[email protected]> wrote:
>
> > I'm developing such a tool for my own use. Right now it only optimizes
> for
> > size, but I'm planning to add query time later. I'm trying to get it open
> > sourced, but the wheels of bureaucracy turn slowly :(
> >
> > Ed
> >
> > On 2025/05/28 15:36:37 Martin Loncaric wrote:
> > > I think Ashish's question was about determining the right configuration
> > in
> > > the first place - IIUC parquet-rewrite requires the user to pass these
> > in.
> > >
> > > I'm not aware of any tool to choose good Parquet configurations
> > > automatically. I sometimes use the parquet-tools pip package / CLI to
> > > inspect Parquet and see how files are configured, but I've only tuned
> > > manually.
> > >
> > > On Tue, May 27, 2025, 16:22 Andrew Lamb <[email protected]>
> wrote:
> > >
> > > > We have one in the arrow-rs repository: parquet-rewrite[1]
> > > >
> > > >
> > > >
> > > > [1]:
> > > >
> > > >
> >
> https://github.com/apache/arrow-rs/blob/0da003becbd6489f483b70e5914a242edd8c6d1a/parquet/src/bin/parquet-rewrite.rs#L18
> > > >
> > > > On Tue, May 27, 2025 at 12:41 PM Ashish Singh <[email protected]>
> > wrote:
> > > >
> > > > > Hey all,
> > > > >
> > > > > Is there any tool/ lib folks use to tune parquet configs to
> optimize
> > for
> > > > > storage size / read/ write speed?
> > > > >
> > > > > - Ashish
> > > > >
> > > >
> > >
> >
>

Reply via email to