[jira] [Commented] (SPARK-4633) Support gzip in spark.compression.io.codec

2016-09-12 Thread Adam Roberts (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15484488#comment-15484488
 ] 

Adam Roberts commented on SPARK-4633:
-

Very interested in this and I know Nasser Ebrahim is also (full disclosure that 
we both work for IBM).

https://www.rootusers.com/gzip-vs-bzip2-vs-xz-performance-comparison/ shows 
promising results

Would be interesting to code up a quick prototype (perhaps based on the pull 
request here) and to see what performance difference we can gain, looks like 
Takeshi has done the starting work for us

> Support gzip in spark.compression.io.codec
> --
>
> Key: SPARK-4633
> URL: https://issues.apache.org/jira/browse/SPARK-4633
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Reporter: Takeshi Yamamuro
>Priority: Trivial
>
> gzip is widely used in other frameowrks such as hadoop mapreduce and tez, and 
> also
> I think that gizip is more stable than other codecs in terms of both 
> performance
> and space overheads.
> I have one open question; current spark configuratios have a block size option
> for each codec (spark.io.compression.[gzip|lz4|snappy].block.size).
> As # of codecs increases, the configurations have more options and
> I think that it is sort of complicated for non-expert users.
> To mitigate it, my thought follows;
> the three configurations are replaced with a single option for block size
> (spark.io.compression.block.size). Then, 'Meaning' in configurations
> will describe "This option makes an effect on gzip, lz4, and snappy. 
> Block size (in bytes) used in compression, in the case when these compression
> codecs are used. Lowering...".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4633) Support gzip in spark.compression.io.codec

2014-11-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14227133#comment-14227133
 ] 

Apache Spark commented on SPARK-4633:
-

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/3488

 Support gzip in spark.compression.io.codec
 --

 Key: SPARK-4633
 URL: https://issues.apache.org/jira/browse/SPARK-4633
 Project: Spark
  Issue Type: Improvement
  Components: Input/Output
Reporter: Takeshi Yamamuro
Priority: Trivial

 gzip is widely used in other frameowrks such as hadoop mapreduce and tez, and 
 also
 I think that gizip is more stable than other codecs in terms of both 
 performance
 and space overheads.
 I have one open question; current spark configuratios have a block size option
 for each codec (spark.io.compression.[gzip|lz4|snappy].block.size).
 As # of codecs increases, the configurations have more options and
 I think that it is sort of complicated for non-expert users.
 To mitigate it, my thought follows;
 the three configurations are replaced with a single option for block size
 (spark.io.compression.block.size). Then, 'Meaning' in configurations
 will describe This option makes an effect on gzip, lz4, and snappy. 
 Block size (in bytes) used in compression, in the case when these compression
 codecs are used. Lowering



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org