Re: Parquet compression codecs not applied

Cheng Lian Thu, 05 Feb 2015 17:14:31 -0800

Hi Ayoub,

The doc page isn’t wrong, but it’s indeed confusing.|spark.sql.parquet.compression.codec| is used when you’re wring Parquetfile with something like |data.saveAsParquetFile(...)|. However, you areusing Hive DDL in the example code. All Hive DDLs and commands like|SET| are directly delegated to Hive, which unfortunately ignores Sparkconfigurations. And yet, it should be updated.


Best,
Cheng

On 1/10/15 5:49 AM, Ayoub Benali wrote:

it worked thanks.

this doc page<https://spark.apache.org/docs/1.2.0/sql-programming-guide.html>recommendsto use "spark.sql.parquet.compression.codec" to set the compressioncoded and I thought this setting would be forwarded to the hivecontext given that HiveContext extends SQLContext, but it was not.

I am wondering if this behavior is normal, if not I could open anissue with a potential fix so that"spark.sql.parquet.compression.codec" would be translated to"parquet.compression" in the hive context ?

Or the documentation should be updated to mention that the compressioncoded is set differently with HiveContext.


Ayoub.

2015-01-09 17:51 GMT+01:00 Michael Armbrust <mich...@databricks.com<mailto:mich...@databricks.com>>:


    This is a little confusing, but that code path is actually going
    through hive.  So the spark sql configuration does not help.

    Perhaps, try:
    set parquet.compression=GZIP;

    On Fri, Jan 9, 2015 at 2:41 AM, Ayoub <benali.ayoub.i...@gmail.com
    <mailto:benali.ayoub.i...@gmail.com>> wrote:

        Hello,

        I tried to save a table created via the hive context as a
        parquet file but
        whatever compression codec (uncompressed, snappy, gzip or lzo)
        I set via
        setConf like:

        setConf("spark.sql.parquet.compression.codec", "gzip")

        the size of the generated files is the always the same, so it
        seems like
        spark context ignores the compression codec that I set.

        Here is a code sample applied via the spark shell:

        import org.apache.spark.sql.hive.HiveContext
        val hiveContext = new HiveContext(sc)

        hiveContext.sql("SET hive.exec.dynamic.partition = true")
        hiveContext.sql("SET hive.exec.dynamic.partition.mode =
        nonstrict")
        hiveContext.setConf("spark.sql.parquet.binaryAsString",
        "true") // required
        to make data compatible with impala
        hiveContext.setConf("spark.sql.parquet.compression.codec", "gzip")

        hiveContext.sql("create external table if not exists foo (bar
        STRING, ts
        INT) Partitioned by (year INT, month INT, day INT) STORED AS
        PARQUET
        Location 'hdfs://path/data/foo'")

        hiveContext.sql("insert into table foo partition(year,
        month,day) select *,
        year(from_unixtime(ts)) as year, month(from_unixtime(ts)) as
        month,
        day(from_unixtime(ts)) as day from raw_foo")

        I tried that with spark 1.2 and 1.3 snapshot against hive 0.13
        and I also tried that with Impala on the same cluster which
        applied
        correctly the compression codecs.

        Does anyone know what could be the problem ?

        Thanks,
        Ayoub.




        --
        View this message in context:
        
http://apache-spark-user-list.1001560.n3.nabble.com/Parquet-compression-codecs-not-applied-tp21058.html
        Sent from the Apache Spark User List mailing list archive at
        Nabble.com.

        ---------------------------------------------------------------------
        To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
        <mailto:user-unsubscr...@spark.apache.org>
        For additional commands, e-mail: user-h...@spark.apache.org
        <mailto:user-h...@spark.apache.org>

Re: Parquet compression codecs not applied

Reply via email to