Hi Ayoub,

The doc page isn’t wrong, but it’s indeed confusing. |spark.sql.parquet.compression.codec| is used when you’re wring Parquet file with something like |data.saveAsParquetFile(...)|. However, you are using Hive DDL in the example code. All Hive DDLs and commands like |SET| are directly delegated to Hive, which unfortunately ignores Spark configurations. And yet, it should be updated.

Best,
Cheng

On 1/10/15 5:49 AM, Ayoub Benali wrote:

it worked thanks.

this doc page <https://spark.apache.org/docs/1.2.0/sql-programming-guide.html>recommends to use "spark.sql.parquet.compression.codec" to set the compression coded and I thought this setting would be forwarded to the hive context given that HiveContext extends SQLContext, but it was not.

I am wondering if this behavior is normal, if not I could open an issue with a potential fix so that "spark.sql.parquet.compression.codec" would be translated to "parquet.compression" in the hive context ?

Or the documentation should be updated to mention that the compression coded is set differently with HiveContext.

Ayoub.



2015-01-09 17:51 GMT+01:00 Michael Armbrust <mich...@databricks.com <mailto:mich...@databricks.com>>:

    This is a little confusing, but that code path is actually going
    through hive.  So the spark sql configuration does not help.

    Perhaps, try:
    set parquet.compression=GZIP;

    On Fri, Jan 9, 2015 at 2:41 AM, Ayoub <benali.ayoub.i...@gmail.com
    <mailto:benali.ayoub.i...@gmail.com>> wrote:

        Hello,

        I tried to save a table created via the hive context as a
        parquet file but
        whatever compression codec (uncompressed, snappy, gzip or lzo)
        I set via
        setConf like:

        setConf("spark.sql.parquet.compression.codec", "gzip")

        the size of the generated files is the always the same, so it
        seems like
        spark context ignores the compression codec that I set.

        Here is a code sample applied via the spark shell:

        import org.apache.spark.sql.hive.HiveContext
        val hiveContext = new HiveContext(sc)

        hiveContext.sql("SET hive.exec.dynamic.partition = true")
        hiveContext.sql("SET hive.exec.dynamic.partition.mode =
        nonstrict")
        hiveContext.setConf("spark.sql.parquet.binaryAsString",
        "true") // required
        to make data compatible with impala
        hiveContext.setConf("spark.sql.parquet.compression.codec", "gzip")

        hiveContext.sql("create external table if not exists foo (bar
        STRING, ts
        INT) Partitioned by (year INT, month INT, day INT) STORED AS
        PARQUET
        Location 'hdfs://path/data/foo'")

        hiveContext.sql("insert into table foo partition(year,
        month,day) select *,
        year(from_unixtime(ts)) as year, month(from_unixtime(ts)) as
        month,
        day(from_unixtime(ts)) as day from raw_foo")

        I tried that with spark 1.2 and 1.3 snapshot against hive 0.13
        and I also tried that with Impala on the same cluster which
        applied
        correctly the compression codecs.

        Does anyone know what could be the problem ?

        Thanks,
        Ayoub.




        --
        View this message in context:
        
http://apache-spark-user-list.1001560.n3.nabble.com/Parquet-compression-codecs-not-applied-tp21058.html
        Sent from the Apache Spark User List mailing list archive at
        Nabble.com.

        ---------------------------------------------------------------------
        To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
        <mailto:user-unsubscr...@spark.apache.org>
        For additional commands, e-mail: user-h...@spark.apache.org
        <mailto:user-h...@spark.apache.org>



Reply via email to