it worked thanks. this doc page <https://spark.apache.org/docs/1.2.0/sql-programming-guide.html>recommends to use "spark.sql.parquet.compression.codec" to set the compression coded and I thought this setting would be forwarded to the hive context given that HiveContext extends SQLContext, but it was not.
I am wondering if this behavior is normal, if not I could open an issue with a potential fix so that "spark.sql.parquet.compression.codec" would be translated to "parquet.compression" in the hive context ? Or the documentation should be updated to mention that the compression coded is set differently with HiveContext. Ayoub. 2015-01-09 17:51 GMT+01:00 Michael Armbrust <mich...@databricks.com>: > This is a little confusing, but that code path is actually going through > hive. So the spark sql configuration does not help. > > Perhaps, try: > set parquet.compression=GZIP; > > On Fri, Jan 9, 2015 at 2:41 AM, Ayoub <benali.ayoub.i...@gmail.com> wrote: > >> Hello, >> >> I tried to save a table created via the hive context as a parquet file but >> whatever compression codec (uncompressed, snappy, gzip or lzo) I set via >> setConf like: >> >> setConf("spark.sql.parquet.compression.codec", "gzip") >> >> the size of the generated files is the always the same, so it seems like >> spark context ignores the compression codec that I set. >> >> Here is a code sample applied via the spark shell: >> >> import org.apache.spark.sql.hive.HiveContext >> val hiveContext = new HiveContext(sc) >> >> hiveContext.sql("SET hive.exec.dynamic.partition = true") >> hiveContext.sql("SET hive.exec.dynamic.partition.mode = nonstrict") >> hiveContext.setConf("spark.sql.parquet.binaryAsString", "true") // >> required >> to make data compatible with impala >> hiveContext.setConf("spark.sql.parquet.compression.codec", "gzip") >> >> hiveContext.sql("create external table if not exists foo (bar STRING, ts >> INT) Partitioned by (year INT, month INT, day INT) STORED AS PARQUET >> Location 'hdfs://path/data/foo'") >> >> hiveContext.sql("insert into table foo partition(year, month,day) select >> *, >> year(from_unixtime(ts)) as year, month(from_unixtime(ts)) as month, >> day(from_unixtime(ts)) as day from raw_foo") >> >> I tried that with spark 1.2 and 1.3 snapshot against hive 0.13 >> and I also tried that with Impala on the same cluster which applied >> correctly the compression codecs. >> >> Does anyone know what could be the problem ? >> >> Thanks, >> Ayoub. >> >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/Parquet-compression-codecs-not-applied-tp21058.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> >> >