[ 
https://issues.apache.org/jira/browse/SPARK-38703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17515907#comment-17515907
 ] 

Cheng Pan edited comment on SPARK-38703 at 4/1/22 12:55 PM:
------------------------------------------------------------

SPARK-34390 may help, our benchmark of 1T TPC-DS shows the benefits. 
(compression using zstd in shuffle, not parquet)

{code:bash}
+-------------------+-----------------------+-----------------------+-----------------+
|        lz4        | sum(task_cpu_time_s)  | sum(task_run_time_s)  | 
sum(gc_time_s)  |
+-------------------+-----------------------+-----------------------+-----------------+
| lz4               | 1871242.5             | 3861923.8             | 197151.5  
      |
| zstd              | 1989641.6             | 3326399.8             | 244333.2  
      |
| zstd_buffer_pool  | 1912032.0             | 3342339.4             | 187262.3  
      |
+-------------------+-----------------------+-----------------------+-----------------+
{code}






was (Author: pan3793):
SPARK-34390 may helps, our benchmark of 1T TPC-DS shows the benefits. 
(compression using zstd in shuffle, not parquet)

{code:bash}
+-------------------+-----------------------+-----------------------+-----------------+
|        lz4        | sum(task_cpu_time_s)  | sum(task_run_time_s)  | 
sum(gc_time_s)  |
+-------------------+-----------------------+-----------------------+-----------------+
| lz4               | 1871242.5             | 3861923.8             | 197151.5  
      |
| zstd              | 1989641.6             | 3326399.8             | 244333.2  
      |
| zstd_buffer_pool  | 1912032.0             | 3342339.4             | 187262.3  
      |
+-------------------+-----------------------+-----------------------+-----------------+
{code}





> High GC and memory footprint after switch to ZSTD
> -------------------------------------------------
>
>                 Key: SPARK-38703
>                 URL: https://issues.apache.org/jira/browse/SPARK-38703
>             Project: Spark
>          Issue Type: Question
>          Components: Input/Output
>    Affects Versions: 3.1.2
>            Reporter: Michael Taranov
>            Priority: Major
>
> Hi All,
> We started to switch our Spark pipelines to read parquet with ZSTD 
> compression. 
> After the switch we see that memory footprint is much larger than previously 
> with SNAPPY.
> Additionally GC stats of the jobs are much higher comparing to SNAPPY with 
> the same workload as previously. 
> Is there any configurations that may be relevant to read path, that may help 
> in such cases ?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to