I’ve often seen 100MB as a “reasonable” default choice. But I don’t have a lot 
of data to substantiate that. On our system we’ve found that smaller (e.g. 5MB) 
leads to too much overhead and larger (e.g. 5GB) leads to OOMs, too much 
overhead parsing footers / stats even if you’re only going to read a couple 
rows, etc.

> On May 28, 2025, at 12:23 PM, Steve Loughran <[email protected]> 
> wrote:
> 
> ?interesting q here.
> 
> TPC benchmarks do give different numbers for different file sizes,
> independent of the nominal TPC scale (e.g different values for 10TB
> numbers, with everything else the same)
> 
> I know it's all so dependent on cluster, app etc -but what sizes do people
> use in (a) benchmarks and (b) production datasets? Or at least: what
> minimum sizes show up as very inefficient, what large sizes seem to show no
> incremental benefit .
> 
> The minimum size is going to be so significant for distributed engines like
> Spark, as there's the work setup costs, but so does using cloud storage as
> the data lake -there's overhead in simply opening files and reading footers
> which will penalise the files. Parquet through DuckDb is inevitably going
> to be very different
> 
> papers with empirical data welcome..
> 
> 
> Steve
> 
> 
> On Wed, 28 May 2025 at 17:52, Ashish Singh <[email protected]>
> wrote:
> 
>> Thanks all!
>> 
>> Yea, I am mostly looking at available tooling to tune parquet files.
>> 
>> Ed, I would be interested to discuss this. Would you (or anyone else) like
>> to have a dedicated discussion on this? To provide some context, at
>> Pinterest we are actively looking into adopting/ building such tooling. We,
>> like others, have been traditionally relying on manual tuning so far, which
>> isn't really scalable.
>> 
>> Best Regards,
>> Ashish
>> 
>> 
>> On Wed, May 28, 2025 at 9:29 AM Ed Seidl <[email protected]> wrote:
>> 
>>> I'm developing such a tool for my own use. Right now it only optimizes
>> for
>>> size, but I'm planning to add query time later. I'm trying to get it open
>>> sourced, but the wheels of bureaucracy turn slowly :(
>>> 
>>> Ed
>>> 
>>> On 2025/05/28 15:36:37 Martin Loncaric wrote:
>>>> I think Ashish's question was about determining the right configuration
>>> in
>>>> the first place - IIUC parquet-rewrite requires the user to pass these
>>> in.
>>>> 
>>>> I'm not aware of any tool to choose good Parquet configurations
>>>> automatically. I sometimes use the parquet-tools pip package / CLI to
>>>> inspect Parquet and see how files are configured, but I've only tuned
>>>> manually.
>>>> 
>>>> On Tue, May 27, 2025, 16:22 Andrew Lamb <[email protected]>
>> wrote:
>>>> 
>>>>> We have one in the arrow-rs repository: parquet-rewrite[1]
>>>>> 
>>>>> 
>>>>> 
>>>>> [1]:
>>>>> 
>>>>> 
>>> 
>> https://github.com/apache/arrow-rs/blob/0da003becbd6489f483b70e5914a242edd8c6d1a/parquet/src/bin/parquet-rewrite.rs#L18
>>>>> 
>>>>> On Tue, May 27, 2025 at 12:41 PM Ashish Singh <[email protected]>
>>> wrote:
>>>>> 
>>>>>> Hey all,
>>>>>> 
>>>>>> Is there any tool/ lib folks use to tune parquet configs to
>> optimize
>>> for
>>>>>> storage size / read/ write speed?
>>>>>> 
>>>>>> - Ashish
>>>>>> 
>>>>> 
>>>> 
>>> 
>> 

Reply via email to