Re: SparkSQL overwrite parquet file does not generate _common_metadata

Cheng Lian Fri, 27 Mar 2015 04:05:30 -0700

Thanks for the information. Verified that the _common_metadata and_metadata file are missing in this case when using Hadoop 1.0.4. Wouldyou mind to open a JIRA for this?


Cheng


On 3/27/15 2:40 PM, Pei-Lun Lee wrote:

I'm using 1.0.4

Thanks,
--
Pei-Lun

On Fri, Mar 27, 2015 at 2:32 PM, Cheng Lian <lian.cs....@gmail.com<mailto:lian.cs....@gmail.com>> wrote:


    Hm, which version of Hadoop are you using? Actually there should
    also be a _metadata file together with _common_metadata. I was
    using Hadoop 2.4.1 btw. I'm not sure whether Hadoop version
    matters here, but I did observe cases where Spark behaves
    differently because of semantic differences of the same API in
    different Hadoop versions.

    Cheng

    On 3/27/15 11:33 AM, Pei-Lun Lee wrote:

    Hi Cheng,

    on my computer, execute res0.save("xxx",
    org.apache.spark.sql.SaveMode.Overwrite) produces:

    peilunlee@pllee-mini:~/opt/spark-1.3...rc3-bin-hadoop1$
    <mailto:peilunlee@pllee-mini:%7E/opt/spark-1.3...rc3-bin-hadoop1$> ls
    -l xxx
    total 32
    -rwxrwxrwx  1 peilunlee  staff    0 Mar 27 11:29 _SUCCESS*
    -rwxrwxrwx  1 peilunlee  staff  272 Mar 27 11:29
    part-r-00001.parquet*
    -rwxrwxrwx  1 peilunlee  staff  272 Mar 27 11:29
    part-r-00002.parquet*
    -rwxrwxrwx  1 peilunlee  staff  272 Mar 27 11:29
    part-r-00003.parquet*
    -rwxrwxrwx  1 peilunlee  staff  488 Mar 27 11:29
    part-r-00004.parquet*

    while res0.save("xxx") produces:

    peilunlee@pllee-mini:~/opt/spark-1.3...rc3-bin-hadoop1$
    <mailto:peilunlee@pllee-mini:%7E/opt/spark-1.3...rc3-bin-hadoop1$> ls
    -l xxx
    total 40
    -rwxrwxrwx  1 peilunlee  staff    0 Mar 27 11:29 _SUCCESS*
    -rwxrwxrwx  1 peilunlee  staff  250 Mar 27 11:29 _common_metadata*
    -rwxrwxrwx  1 peilunlee  staff  272 Mar 27 11:29
    part-r-00001.parquet*
    -rwxrwxrwx  1 peilunlee  staff  272 Mar 27 11:29
    part-r-00002.parquet*
    -rwxrwxrwx  1 peilunlee  staff  272 Mar 27 11:29
    part-r-00003.parquet*
    -rwxrwxrwx  1 peilunlee  staff  488 Mar 27 11:29
    part-r-00004.parquet*

    On Thu, Mar 26, 2015 at 7:26 PM, Cheng Lian
    <lian.cs....@gmail.com <mailto:lian.cs....@gmail.com>> wrote:

        I couldn’t reproduce this with the following spark-shell snippet:

        |scala> import sqlContext.implicits._
        scala> Seq((1, 2)).toDF("a", "b")
        scala> res0.save("xxx", org.apache.spark.sql.SaveMode.Overwrite)
        scala> res0.save("xxx", org.apache.spark.sql.SaveMode.Overwrite)
        |

        The _common_metadata file is typically much smaller than
        _metadata, because it doesn’t contain row group information,
        and thus can be faster to read than _metadata.

        Cheng

        On 3/26/15 12:48 PM, Pei-Lun Lee wrote:

        Hi,

        When I save parquet file with SaveMode.Overwrite, it never
        generate _common_metadata. Whether it overwrites an existing
        dir or not.
        Is this expected behavior?
        And what is the benefit of _common_metadata? Will reading
        performs better when it is present?

        Thanks,
        --
        Pei-Lun

Re: SparkSQL overwrite parquet file does not generate _common_metadata

Reply via email to