[ https://issues.apache.org/jira/browse/SPARK-38703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17515907#comment-17515907 ]
Cheng Pan edited comment on SPARK-38703 at 4/1/22 12:55 PM: ------------------------------------------------------------ SPARK-34390 may help, our benchmark of 1T TPC-DS shows the benefits. (compression using zstd in shuffle, not parquet) {code:bash} +-------------------+-----------------------+-----------------------+-----------------+ | lz4 | sum(task_cpu_time_s) | sum(task_run_time_s) | sum(gc_time_s) | +-------------------+-----------------------+-----------------------+-----------------+ | lz4 | 1871242.5 | 3861923.8 | 197151.5 | | zstd | 1989641.6 | 3326399.8 | 244333.2 | | zstd_buffer_pool | 1912032.0 | 3342339.4 | 187262.3 | +-------------------+-----------------------+-----------------------+-----------------+ {code} was (Author: pan3793): SPARK-34390 may helps, our benchmark of 1T TPC-DS shows the benefits. (compression using zstd in shuffle, not parquet) {code:bash} +-------------------+-----------------------+-----------------------+-----------------+ | lz4 | sum(task_cpu_time_s) | sum(task_run_time_s) | sum(gc_time_s) | +-------------------+-----------------------+-----------------------+-----------------+ | lz4 | 1871242.5 | 3861923.8 | 197151.5 | | zstd | 1989641.6 | 3326399.8 | 244333.2 | | zstd_buffer_pool | 1912032.0 | 3342339.4 | 187262.3 | +-------------------+-----------------------+-----------------------+-----------------+ {code} > High GC and memory footprint after switch to ZSTD > ------------------------------------------------- > > Key: SPARK-38703 > URL: https://issues.apache.org/jira/browse/SPARK-38703 > Project: Spark > Issue Type: Question > Components: Input/Output > Affects Versions: 3.1.2 > Reporter: Michael Taranov > Priority: Major > > Hi All, > We started to switch our Spark pipelines to read parquet with ZSTD > compression. > After the switch we see that memory footprint is much larger than previously > with SNAPPY. > Additionally GC stats of the jobs are much higher comparing to SNAPPY with > the same workload as previously. > Is there any configurations that may be relevant to read path, that may help > in such cases ? -- This message was sent by Atlassian Jira (v8.20.1#820001) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org