Hm, which version of Hadoop are you using? Actually there should also be a _metadata file together with _common_metadata. I was using Hadoop 2.4.1 btw. I'm not sure whether Hadoop version matters here, but I did observe cases where Spark behaves differently because of semantic differences of the same API in different Hadoop versions.

Cheng

On 3/27/15 11:33 AM, Pei-Lun Lee wrote:
Hi Cheng,

on my computer, execute res0.save("xxx", org.apache.spark.sql.SaveMode.Overwrite) produces:

peilunlee@pllee-mini:~/opt/spark-1.3...rc3-bin-hadoop1$ ls -l xxx
total 32
-rwxrwxrwx  1 peilunlee  staff    0 Mar 27 11:29 _SUCCESS*
-rwxrwxrwx  1 peilunlee  staff  272 Mar 27 11:29 part-r-00001.parquet*
-rwxrwxrwx  1 peilunlee  staff  272 Mar 27 11:29 part-r-00002.parquet*
-rwxrwxrwx  1 peilunlee  staff  272 Mar 27 11:29 part-r-00003.parquet*
-rwxrwxrwx  1 peilunlee  staff  488 Mar 27 11:29 part-r-00004.parquet*

while res0.save("xxx") produces:

peilunlee@pllee-mini:~/opt/spark-1.3...rc3-bin-hadoop1$ ls -l xxx
total 40
-rwxrwxrwx  1 peilunlee  staff    0 Mar 27 11:29 _SUCCESS*
-rwxrwxrwx  1 peilunlee  staff  250 Mar 27 11:29 _common_metadata*
-rwxrwxrwx  1 peilunlee  staff  272 Mar 27 11:29 part-r-00001.parquet*
-rwxrwxrwx  1 peilunlee  staff  272 Mar 27 11:29 part-r-00002.parquet*
-rwxrwxrwx  1 peilunlee  staff  272 Mar 27 11:29 part-r-00003.parquet*
-rwxrwxrwx  1 peilunlee  staff  488 Mar 27 11:29 part-r-00004.parquet*

On Thu, Mar 26, 2015 at 7:26 PM, Cheng Lian <lian.cs....@gmail.com <mailto:lian.cs....@gmail.com>> wrote:

    I couldn’t reproduce this with the following spark-shell snippet:

    |scala> import sqlContext.implicits._
    scala> Seq((1, 2)).toDF("a", "b")
    scala> res0.save("xxx", org.apache.spark.sql.SaveMode.Overwrite)
    scala> res0.save("xxx", org.apache.spark.sql.SaveMode.Overwrite)
    |

    The _common_metadata file is typically much smaller than
    _metadata, because it doesn’t contain row group information, and
    thus can be faster to read than _metadata.

    Cheng

    On 3/26/15 12:48 PM, Pei-Lun Lee wrote:

    Hi,

    When I save parquet file with SaveMode.Overwrite, it never
    generate _common_metadata. Whether it overwrites an existing dir
    or not.
    Is this expected behavior?
    And what is the benefit of _common_metadata? Will reading
    performs better when it is present?

    Thanks,
    --
    Pei-Lun

Reply via email to