Hi Ayoub,
The doc page isn’t wrong, but it’s indeed confusing.
|spark.sql.parquet.compression.codec| is used when you’re wring Parquet
file with something like |data.saveAsParquetFile(...)|. However, you are
using Hive DDL in the example code. All Hive DDLs and commands like
|SET| are directly delegated to Hive, which unfortunately ignores Spark
configurations. And yet, it should be updated.
Best,
Cheng
On 1/10/15 5:49 AM, Ayoub Benali wrote:
it worked thanks.
this doc page
<https://spark.apache.org/docs/1.2.0/sql-programming-guide.html>recommends
to use "spark.sql.parquet.compression.codec" to set the compression
coded and I thought this setting would be forwarded to the hive
context given that HiveContext extends SQLContext, but it was not.
I am wondering if this behavior is normal, if not I could open an
issue with a potential fix so that
"spark.sql.parquet.compression.codec" would be translated to
"parquet.compression" in the hive context ?
Or the documentation should be updated to mention that the compression
coded is set differently with HiveContext.
Ayoub.
2015-01-09 17:51 GMT+01:00 Michael Armbrust <mich...@databricks.com
<mailto:mich...@databricks.com>>:
This is a little confusing, but that code path is actually going
through hive. So the spark sql configuration does not help.
Perhaps, try:
set parquet.compression=GZIP;
On Fri, Jan 9, 2015 at 2:41 AM, Ayoub <benali.ayoub.i...@gmail.com
<mailto:benali.ayoub.i...@gmail.com>> wrote:
Hello,
I tried to save a table created via the hive context as a
parquet file but
whatever compression codec (uncompressed, snappy, gzip or lzo)
I set via
setConf like:
setConf("spark.sql.parquet.compression.codec", "gzip")
the size of the generated files is the always the same, so it
seems like
spark context ignores the compression codec that I set.
Here is a code sample applied via the spark shell:
import org.apache.spark.sql.hive.HiveContext
val hiveContext = new HiveContext(sc)
hiveContext.sql("SET hive.exec.dynamic.partition = true")
hiveContext.sql("SET hive.exec.dynamic.partition.mode =
nonstrict")
hiveContext.setConf("spark.sql.parquet.binaryAsString",
"true") // required
to make data compatible with impala
hiveContext.setConf("spark.sql.parquet.compression.codec", "gzip")
hiveContext.sql("create external table if not exists foo (bar
STRING, ts
INT) Partitioned by (year INT, month INT, day INT) STORED AS
PARQUET
Location 'hdfs://path/data/foo'")
hiveContext.sql("insert into table foo partition(year,
month,day) select *,
year(from_unixtime(ts)) as year, month(from_unixtime(ts)) as
month,
day(from_unixtime(ts)) as day from raw_foo")
I tried that with spark 1.2 and 1.3 snapshot against hive 0.13
and I also tried that with Impala on the same cluster which
applied
correctly the compression codecs.
Does anyone know what could be the problem ?
Thanks,
Ayoub.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Parquet-compression-codecs-not-applied-tp21058.html
Sent from the Apache Spark User List mailing list archive at
Nabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
<mailto:user-unsubscr...@spark.apache.org>
For additional commands, e-mail: user-h...@spark.apache.org
<mailto:user-h...@spark.apache.org>