w.r.t benchmarks, I'd look at "An Empirical Evaluation of Columnar Storage Formats (Extended Version)", https://arxiv.org/pdf/2304.05028
On Tue, 26 Aug 2025 at 21:45, Nimrod Ofek <ofek.nim...@gmail.com> wrote: > Hi, > > From my experience, and from all the benchmarks I did and read- snappy > provides much bigger file size compared to zstd, while cpu usage is similar > for both - in most cases not really noticeable. > > We switched to ZSTD and our CPU usage did not increase in a noticeable > manner (maybe an increase of usage of 1-2%, if at all) , while file sizes > dropped by ~35%. > It depends on the data you compress and the hardware you use so there is > no real alternative to trial and error, but for us I can say ZSTD saved a > lot of money... > > Most benefit in terms of speed indeed will come from skipping data you > don't need to read - and the best way to achieve that is by not using > parquet directly, but using open table formats such as Iceberg and Delta. > For instance, in Delta you can gather delta statistics on the most used > for filtering columns, and that way you can get file skipping on files that > are not relevant to your specific query, on top of skipping by partition, > and read much less data. > > HTH, > Nimrod > > בתאריך יום ג׳, 26 באוג׳ 2025, 22:38, מאת Nikolas Vanderhoof < > nikolasrvanderh...@gmail.com>: > >> Thank you for the detailed response. This is helpful. I’ll read your >> article, and test my data as you’ve described. >> >> On Tue, Aug 26, 2025 at 3:05 PM Mich Talebzadeh < >> mich.talebza...@gmail.com> wrote: >> >>> Hi Nikolas, >>> >>> *Why Spark defaults to Snappy for Parquet.* In analytics scans the >>> bottleneck is usually *CPU to decompress Parquet pages*, not raw I/O. >>> Snappy gives *very fast decode* at a decent ratio, so end-to-end query >>> latency is typically better than heavier codecs like GZIP. For colder data, >>> GZIP (or ZSTD) can make sense if you’re chasing storage savings and can >>> afford slower reads. >>> >>> Two different codec decisions to make >>> >>> 1. >>> >>> Intermediates (shuffle/spill/broadcast) — speed > ratio >>> I keep fast codecs here; changing them rarely helps unless the >>> network/disk is the bottleneck and I have spare CPU: >>> >>> *spark.conf.set("spark.shuffle.compress", "true") >>> spark.conf.set("spark.shuffle.spill.compress", "true") >>> spark.conf.set("spark.io.compression.codec", "lz4") // snappy or zstd >>> are also viable >>> * >>> >>> 2. >>> >>> Storage at rest (final Parquet in the lake/lakehouse) — pick by hot >>> vs cold >>> - >>> >>> *Hot / frequently scanned:* *Snappy* for fastest reads. >>> - >>> >>> *Cold / archival:* *GZIP* (or try *ZSTD*) for much smaller files; >>> accept slower scans. >>> >>> *spark.conf.set("spark.sql.parquet.compression.codec", "snappy") // or >>> "gzip" or "zstd"* >>> >>> >>> This mirrors what I wrote up for *BigQuery external Parquet on object >>> storage *as attached (different engine, same storage trade-off): I used >>> *Parquet >>> + GZIP* when exporting to Cloud Storage (great size reduction) and >>> noted that *external tables read slower than native*—so I keep hot data >>> “native” and push colder tiers to cheaper storage with heavier compression. >>> In that piece, a toy query ran ~*190 ms* on native vs ~*296 ms* on the >>> external table (≈43% slower), which is the kind of latency gap you trade >>> for cost/footprint savings on colder data . >>> >>> *Bigger levers than the codec* >>> The codec choice matters, but *reading fewer bytes* matters more! In my >>> article I lean heavily on *Hive-style partition layouts* for external >>> Parquet (multiple partition keys, strict directory order), and call out >>> gotchas like keeping *non-Parquet junk out of leaf directories *(external >>> table creation/reads can fail/slow if the layout’s messy) . >>> >>> How I would benchmark on your data >>> Write the same dataset three ways (snappy, gzip, zstd), then measure: >>> >>> - >>> >>> total bytes on storage, >>> - >>> >>> Spark SQL *scan time* and *CPU time* in the UI, >>> - >>> >>> effect of *partition pruning* with realistic filters. >>> Keep the shuffle settings fast (above) so you’re testing scan costs, >>> not an artificially slow shuffle. >>> >>> My rules of thumb >>> >>> - >>> >>> If *latency* and interactive work matter → *Snappy* Parquet. >>> - >>> >>> If *storage $$* dominates and reads are rare → *GZIP* (or *ZSTD* as >>> a middle ground). >>> - >>> >>> Regardless of codec, *partition pruning + sane file sizes* move the >>> needle the most (that’s the core of my “Hybrid Curated Storage” approach) >>> >>> HTH >>> >>> Regards >>> Dr Mich Talebzadeh, >>> Architect | Data Science | Financial Crime | Forensic Analysis | GDPR >>> >>> (P.S. The background and examples I referenced are from my article on >>> using *GCS external Parquet* with *Snappy/GZIP/ZSTD* and Hive >>> partitioning for cost/perf balance—feel free to skim the compression/export >>> and partitioning sections.) >>> >>> view my Linkedin profile >>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>> >>> >>> >>> On Tue, 26 Aug 2025 at 17:59, Nikolas Vanderhoof < >>> nikolasrvanderh...@gmail.com> wrote: >>> >>>> Hello, >>>> >>>> Why does Spark use Snappy by default when compressing data within >>>> Parquet? I’ve read that when shuffling, speed is prioritized above >>>> compression ratio. Is that true, and are there other things to consider? >>>> >>>> Also, are there any recent benchmarks that the community has performed >>>> that evaluate the performance of Spark when using Snappy compared to other >>>> codecs? I’d be interested not only in the impact when using other codecs >>>> for the intermediate and shuffle files, but also for the storage at rest. >>>> For example, I know there are different configuration options that allow me >>>> to set the codec for these internal files, or for the final parquet files >>>> stored in the lakehouse. >>>> >>>> Before I decide to use a codec other than the default in my work, I >>>> want to understand any tradeoffs better. >>>> >>>> Thanks, >>>> Nik >>>> >>>