Parquet compression codecs not applied

Ayoub Fri, 09 Jan 2015 02:43:31 -0800

Hello, 

I tried to save a table created via the hive context as a parquet file but
whatever compression codec (uncompressed, snappy, gzip or lzo) I set via
setConf like:


setConf("spark.sql.parquet.compression.codec", "gzip") 

the size of the generated files is the always the same, so it seems like
spark context ignores the compression codec that I set. 

Here is a code sample applied via the spark shell: 

import org.apache.spark.sql.hive.HiveContext 
val hiveContext = new HiveContext(sc) 

hiveContext.sql("SET hive.exec.dynamic.partition = true") 
hiveContext.sql("SET hive.exec.dynamic.partition.mode = nonstrict") 
hiveContext.setConf("spark.sql.parquet.binaryAsString", "true") // required
to make data compatible with impala 
hiveContext.setConf("spark.sql.parquet.compression.codec", "gzip") 

hiveContext.sql("create external table if not exists foo (bar STRING, ts
INT) Partitioned by (year INT, month INT, day INT) STORED AS PARQUET
Location 'hdfs://path/data/foo'") 

hiveContext.sql("insert into table foo partition(year, month,day) select *,
year(from_unixtime(ts)) as year, month(from_unixtime(ts)) as month, 
day(from_unixtime(ts)) as day from raw_foo") 

I tried that with spark 1.2 and 1.3 snapshot against hive 0.13 
and I also tried that with Impala on the same cluster which applied
correctly the compression codecs. 

Does anyone know what could be the problem ? 

Thanks, 
Ayoub.




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Parquet-compression-codecs-not-applied-tp21058.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Parquet compression codecs not applied

Reply via email to