w.r.t benchmarks, I'd look at "An Empirical Evaluation of Columnar Storage
Formats (Extended Version)", https://arxiv.org/pdf/2304.05028




On Tue, 26 Aug 2025 at 21:45, Nimrod Ofek <ofek.nim...@gmail.com> wrote:

> Hi,
>
> From my experience, and from all the benchmarks I did and read- snappy
> provides much bigger file size compared to zstd, while cpu usage is similar
> for both - in most cases not really noticeable.
>
> We switched to ZSTD and our CPU usage did not increase in a noticeable
> manner (maybe an increase of usage of 1-2%, if at all) , while file sizes
> dropped by ~35%.
> It depends on the data you compress and the hardware you use so there is
> no real alternative to trial and error, but for us I can say ZSTD saved a
> lot of money...
>
> Most benefit in terms of speed indeed will come from skipping data you
> don't need to read - and the best way to achieve that is by not using
> parquet directly, but using open table formats such as Iceberg and Delta.
> For instance, in Delta you can gather delta statistics on the most used
> for filtering columns, and that way you can get file skipping on files that
> are not relevant to your specific query, on top of skipping by partition,
> and read much less data.
>
> HTH,
> Nimrod
>
> בתאריך יום ג׳, 26 באוג׳ 2025, 22:38, מאת Nikolas Vanderhoof ‏<
> nikolasrvanderh...@gmail.com>:
>
>> Thank you for the detailed response. This is helpful. I’ll read your
>> article, and test my data as you’ve described.
>>
>> On Tue, Aug 26, 2025 at 3:05 PM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Hi Nikolas,
>>>
>>> *Why Spark defaults to Snappy for Parquet.* In analytics scans the
>>> bottleneck is usually *CPU to decompress Parquet pages*, not raw I/O.
>>> Snappy gives *very fast decode* at a decent ratio, so end-to-end query
>>> latency is typically better than heavier codecs like GZIP. For colder data,
>>> GZIP (or ZSTD) can make sense if you’re chasing storage savings and can
>>> afford slower reads.
>>>
>>> Two different codec decisions to make
>>>
>>>    1.
>>>
>>>    Intermediates (shuffle/spill/broadcast) — speed > ratio
>>>    I keep fast codecs here; changing them rarely helps unless the
>>>    network/disk is the bottleneck and I have spare CPU:
>>>
>>>    *spark.conf.set("spark.shuffle.compress", "true")
>>>    spark.conf.set("spark.shuffle.spill.compress", "true")
>>>    spark.conf.set("spark.io.compression.codec", "lz4")   // snappy or zstd 
>>> are also viable
>>>    *
>>>
>>>    2.
>>>
>>>    Storage at rest (final Parquet in the lake/lakehouse) — pick by hot
>>>    vs cold
>>>    -
>>>
>>>       *Hot / frequently scanned:* *Snappy* for fastest reads.
>>>       -
>>>
>>>       *Cold / archival:* *GZIP* (or try *ZSTD*) for much smaller files;
>>>       accept slower scans.
>>>
>>>    *spark.conf.set("spark.sql.parquet.compression.codec", "snappy") // or 
>>> "gzip" or "zstd"*
>>>
>>>
>>> This mirrors what I wrote up for *BigQuery external Parquet on object
>>> storage *as attached (different engine, same storage trade-off): I used 
>>> *Parquet
>>> + GZIP* when exporting to Cloud Storage (great size reduction) and
>>> noted that *external tables read slower than native*—so I keep hot data
>>> “native” and push colder tiers to cheaper storage with heavier compression.
>>> In that piece, a toy query ran ~*190 ms* on native vs ~*296 ms* on the
>>> external table (≈43% slower), which is the kind of latency gap you trade
>>> for cost/footprint savings on colder data .
>>>
>>> *Bigger levers than the codec*
>>> The codec choice matters, but *reading fewer bytes* matters more! In my
>>> article I lean heavily on *Hive-style partition layouts* for external
>>> Parquet (multiple partition keys, strict directory order), and call out
>>> gotchas like keeping *non-Parquet junk out of leaf directories *(external
>>> table creation/reads can fail/slow if the layout’s messy) .
>>>
>>> How I would benchmark on your data
>>> Write the same dataset three ways (snappy, gzip, zstd), then measure:
>>>
>>>    -
>>>
>>>    total bytes on storage,
>>>    -
>>>
>>>    Spark SQL *scan time* and *CPU time* in the UI,
>>>    -
>>>
>>>    effect of *partition pruning* with realistic filters.
>>>    Keep the shuffle settings fast (above) so you’re testing scan costs,
>>>    not an artificially slow shuffle.
>>>
>>> My rules of thumb
>>>
>>>    -
>>>
>>>    If *latency* and interactive work matter → *Snappy* Parquet.
>>>    -
>>>
>>>    If *storage $$* dominates and reads are rare → *GZIP* (or *ZSTD* as
>>>    a middle ground).
>>>    -
>>>
>>>    Regardless of codec, *partition pruning + sane file sizes* move the
>>>    needle the most (that’s the core of my “Hybrid Curated Storage” approach)
>>>
>>> HTH
>>>
>>> Regards
>>> Dr Mich Talebzadeh,
>>> Architect | Data Science | Financial Crime | Forensic Analysis | GDPR
>>>
>>> (P.S. The background and examples I referenced are from my article on
>>> using *GCS external Parquet* with *Snappy/GZIP/ZSTD* and Hive
>>> partitioning for cost/perf balance—feel free to skim the compression/export
>>> and partitioning sections.)
>>>
>>>    view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>
>>> On Tue, 26 Aug 2025 at 17:59, Nikolas Vanderhoof <
>>> nikolasrvanderh...@gmail.com> wrote:
>>>
>>>> Hello,
>>>>
>>>> Why does Spark use Snappy by default when compressing data within
>>>> Parquet? I’ve read that when shuffling, speed is prioritized above
>>>> compression ratio. Is that true, and are there other things to consider?
>>>>
>>>> Also, are there any recent benchmarks that the community has performed
>>>> that evaluate the performance of Spark when using Snappy compared to other
>>>> codecs? I’d be interested not only in the impact when using other codecs
>>>> for the intermediate and shuffle files, but also for the storage at rest.
>>>> For example, I know there are different configuration options that allow me
>>>> to set the codec for these internal files, or for the final parquet files
>>>> stored in the lakehouse.
>>>>
>>>> Before I decide to use a codec other than the default in my work, I
>>>> want to understand any tradeoffs better.
>>>>
>>>> Thanks,
>>>> Nik
>>>>
>>>

Reply via email to