I'm curious about this as well. I've made some attempts at write
benchmarking but the challenge is that the "optimal" configuration is so
dependent on how you intend to read the data... for example, we used to
recommend a 512MB block size as a reasonable default, which worked well for
wide schemas that were always read with tiny projections, but not so good
for narrow schemas intended to be read in their entirety. Same with the
page size param - bumping the default value improves compression, but
depending on the distribution of column values, statistics filtering
degrades.

A lot of the time it ends up being a tradeoff between saving money on
storage, or on downstream processing costs (and as Steve mentioned, even
that varies by processing engine).

It could be helpful to publish some kind of qualitative tuning guide
somewhere in the Parquet docs, since I feel like I've mostly
learned through trial and error, and reading through parquet-java internals
:)

Claire

On Wed, May 28, 2025 at 8:40 PM Ashish Singh <[email protected]> wrote:

> > FWIW the tool is python, so I use pyarrow when generating numbers. I
> haven't yet tested to see how well the results translate to other writers.
>
> Would be curious about this too.
>
> On Wed, May 28, 2025 at 12:28 PM Ed Seidl <[email protected]> wrote:
>
> > Yes, right now we're targeting pyarrow and parquet-cpp, but will add
> > parquet-rs soon too. We haven't used parquet-java for quite a while, so
> > I've lost track of the possible configs there.
> >
> > All columns get PLAIN and DICTIONARY encoding, and then I'll add in other
> > encodings based on the physical type of the column. Other than that,
> there
> > are no other heuristics, but there are C/L flags to limit the test space
> > (can select only certain columns and cut down on compression codecs for
> > instance).
> >
> > FWIW the tool is python, so I use pyarrow when generating numbers. I
> > haven't yet tested to see how well the results translate to other
> writers.
> >
> > On 2025/05/28 19:13:41 Ashish Singh wrote:
> > > > What my tool does is, for a given input parquet file and for each
> > column,
> > > cycle through all combinations of column encoding, column compression,
> > and
> > > max dictionary size. When it's done the optimal settings (to minimize
> > file
> > > size) for those are given for each column, along with code snippets to
> > set
> > > them (either pyarrow or parquet-cpp at the moment).
> > >
> > > Thanks Ed. Do you cycle through all possible configs for an input file
> or
> > > do you also use some heuristics to narrow the search space? The per
> > > column compression tuning seems to be not achievable on parquet-java
> > > currently, sounds like your use-case is primarily on pyarrow and
> > > parquet-cpp?
> > >
> > >
> > > On Wed, May 28, 2025 at 11:37 AM Ed Seidl <[email protected]> wrote:
> > >
> > > > What my tool does is, for a given input parquet file and for each
> > column,
> > > > cycle through all combinations of column encoding, column
> compression,
> > and
> > > > max dictionary size. When it's done the optimal settings (to minimize
> > file
> > > > size) for those are given for each column, along with code snippets
> to
> > set
> > > > them (either pyarrow or parquet-cpp at the moment).
> > > >
> > > > In the past I've done a little tuning work on row group/page size for
> > > > point lookup on hdfs, but that was all manual.
> > > >
> > > > Ed
> > > >
> > > > On 2025/05/28 17:58:38 Ashish Singh wrote:
> > > > > We typically aim at 800 Mbs file sizes for object stores. However,
> > we are
> > > > > not interested in changing file content or size as part of the
> > parquet
> > > > > tuning. We simply want to optimize the content of file to optimize
> > for a
> > > > > particular resource like, storage size, read speed, write speed,
> etc.
> > > > >
> > > > >
> > > > > On Wed, May 28, 2025 at 10:27 AM Adrian Garcia Badaracco <
> > > > > [email protected]> wrote:
> > > > >
> > > > > > I’ve often seen 100MB as a “reasonable” default choice. But I
> don’t
> > > > have a
> > > > > > lot of data to substantiate that. On our system we’ve found that
> > > > smaller
> > > > > > (e.g. 5MB) leads to too much overhead and larger (e.g. 5GB) leads
> > to
> > > > OOMs,
> > > > > > too much overhead parsing footers / stats even if you’re only
> > going to
> > > > read
> > > > > > a couple rows, etc.
> > > > > >
> > > > > > > On May 28, 2025, at 12:23 PM, Steve Loughran
> > > > <[email protected]>
> > > > > > wrote:
> > > > > > >
> > > > > > > ?interesting q here.
> > > > > > >
> > > > > > > TPC benchmarks do give different numbers for different file
> > sizes,
> > > > > > > independent of the nominal TPC scale (e.g different values for
> > 10TB
> > > > > > > numbers, with everything else the same)
> > > > > > >
> > > > > > > I know it's all so dependent on cluster, app etc -but what
> sizes
> > do
> > > > > > people
> > > > > > > use in (a) benchmarks and (b) production datasets? Or at least:
> > what
> > > > > > > minimum sizes show up as very inefficient, what large sizes
> seem
> > to
> > > > show
> > > > > > no
> > > > > > > incremental benefit .
> > > > > > >
> > > > > > > The minimum size is going to be so significant for distributed
> > > > engines
> > > > > > like
> > > > > > > Spark, as there's the work setup costs, but so does using cloud
> > > > storage
> > > > > > as
> > > > > > > the data lake -there's overhead in simply opening files and
> > reading
> > > > > > footers
> > > > > > > which will penalise the files. Parquet through DuckDb is
> > inevitably
> > > > going
> > > > > > > to be very different
> > > > > > >
> > > > > > > papers with empirical data welcome..
> > > > > > >
> > > > > > >
> > > > > > > Steve
> > > > > > >
> > > > > > >
> > > > > > > On Wed, 28 May 2025 at 17:52, Ashish Singh <
> > > > [email protected]>
> > > > > > > wrote:
> > > > > > >
> > > > > > >> Thanks all!
> > > > > > >>
> > > > > > >> Yea, I am mostly looking at available tooling to tune parquet
> > files.
> > > > > > >>
> > > > > > >> Ed, I would be interested to discuss this. Would you (or
> anyone
> > > > else)
> > > > > > like
> > > > > > >> to have a dedicated discussion on this? To provide some
> > context, at
> > > > > > >> Pinterest we are actively looking into adopting/ building such
> > > > tooling.
> > > > > > We,
> > > > > > >> like others, have been traditionally relying on manual tuning
> so
> > > > far,
> > > > > > which
> > > > > > >> isn't really scalable.
> > > > > > >>
> > > > > > >> Best Regards,
> > > > > > >> Ashish
> > > > > > >>
> > > > > > >>
> > > > > > >> On Wed, May 28, 2025 at 9:29 AM Ed Seidl <[email protected]>
> > > > wrote:
> > > > > > >>
> > > > > > >>> I'm developing such a tool for my own use. Right now it only
> > > > optimizes
> > > > > > >> for
> > > > > > >>> size, but I'm planning to add query time later. I'm trying to
> > get
> > > > it
> > > > > > open
> > > > > > >>> sourced, but the wheels of bureaucracy turn slowly :(
> > > > > > >>>
> > > > > > >>> Ed
> > > > > > >>>
> > > > > > >>> On 2025/05/28 15:36:37 Martin Loncaric wrote:
> > > > > > >>>> I think Ashish's question was about determining the right
> > > > > > configuration
> > > > > > >>> in
> > > > > > >>>> the first place - IIUC parquet-rewrite requires the user to
> > pass
> > > > these
> > > > > > >>> in.
> > > > > > >>>>
> > > > > > >>>> I'm not aware of any tool to choose good Parquet
> > configurations
> > > > > > >>>> automatically. I sometimes use the parquet-tools pip
> package /
> > > > CLI to
> > > > > > >>>> inspect Parquet and see how files are configured, but I've
> > only
> > > > tuned
> > > > > > >>>> manually.
> > > > > > >>>>
> > > > > > >>>> On Tue, May 27, 2025, 16:22 Andrew Lamb <
> > [email protected]>
> > > > > > >> wrote:
> > > > > > >>>>
> > > > > > >>>>> We have one in the arrow-rs repository: parquet-rewrite[1]
> > > > > > >>>>>
> > > > > > >>>>>
> > > > > > >>>>>
> > > > > > >>>>> [1]:
> > > > > > >>>>>
> > > > > > >>>>>
> > > > > > >>>
> > > > > > >>
> > > > > >
> > > >
> >
> https://github.com/apache/arrow-rs/blob/0da003becbd6489f483b70e5914a242edd8c6d1a/parquet/src/bin/parquet-rewrite.rs#L18
> > > > > > >>>>>
> > > > > > >>>>> On Tue, May 27, 2025 at 12:41 PM Ashish Singh <
> > [email protected]
> > > > >
> > > > > > >>> wrote:
> > > > > > >>>>>
> > > > > > >>>>>> Hey all,
> > > > > > >>>>>>
> > > > > > >>>>>> Is there any tool/ lib folks use to tune parquet configs
> to
> > > > > > >> optimize
> > > > > > >>> for
> > > > > > >>>>>> storage size / read/ write speed?
> > > > > > >>>>>>
> > > > > > >>>>>> - Ashish
> > > > > > >>>>>>
> > > > > > >>>>>
> > > > > > >>>>
> > > > > > >>>
> > > > > > >>
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Reply via email to