[ https://issues.apache.org/jira/browse/SPARK-29280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16939864#comment-16939864 ]
Nicholas Chammas commented on SPARK-29280: ------------------------------------------ cc [~hyukjin.kwon], [~cloud_fan] > DataFrameReader should support a compression option > --------------------------------------------------- > > Key: SPARK-29280 > URL: https://issues.apache.org/jira/browse/SPARK-29280 > Project: Spark > Issue Type: Improvement > Components: Input/Output > Affects Versions: 2.4.4 > Reporter: Nicholas Chammas > Priority: Minor > > [DataFrameWriter|http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameWriter] > supports a {{compression}} option, but > [DataFrameReader|http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader] > doesn't. The lack of a {{compression}} option in the reader causes some > friction in the following cases: > # You want to read some data compressed with a codec that Spark does not > [load by > default|http://spark.apache.org/docs/latest/configuration.html#compression-and-serialization]. > # You want to read some data with a codec that overrides one of the built-in > codecs that Spark supports. > # You want to explicitly instruct Spark on what codec to use on read when it > will not be able to correctly auto-detect it (e.g. because the file extension > is [missing,|https://stackoverflow.com/q/52011697/877069] > [non-standard|https://stackoverflow.com/q/44372995/877069], or > [incorrect|https://stackoverflow.com/q/49110384/877069]). > Case #2 came up in SPARK-29102. There is a very handy library called > [SplittableGzip|https://github.com/nielsbasjes/splittablegzip] that lets you > load a single gzipped file using multiple concurrent tasks. (You can see the > details of how it works and why it's useful in the project README and in > SPARK-29102.) > To use this codec, I had to set {{io.compression.codecs}}. I guess this is a > Hadoop filesystem API setting, since it [doesn't appear to be documented by > Spark|http://spark.apache.org/docs/latest/configuration.html]. Confusingly, > there is also a setting called {{spark.io.compression.codec}}, which seems to > be for a different purpose. > It would be much clearer for the user and more consistent with the writer > interface if the reader let you directly specify the codec. > For example: > {code:java} > spark.read.option('compression', 'lz4').csv(...) > spark.read.csv(..., > compression='nl.basjes.hadoop.io.compress.SplittableGzipCodec') {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org