JIRA ticket created at: https://issues.apache.org/jira/browse/SPARK-6581
Thanks, -- Pei-Lun On Fri, Mar 27, 2015 at 7:03 PM, Cheng Lian <lian.cs....@gmail.com> wrote: > Thanks for the information. Verified that the _common_metadata and > _metadata file are missing in this case when using Hadoop 1.0.4. Would you > mind to open a JIRA for this? > > Cheng > > On 3/27/15 2:40 PM, Pei-Lun Lee wrote: > > I'm using 1.0.4 > > Thanks, > -- > Pei-Lun > > On Fri, Mar 27, 2015 at 2:32 PM, Cheng Lian <lian.cs....@gmail.com> wrote: > >> Hm, which version of Hadoop are you using? Actually there should also >> be a _metadata file together with _common_metadata. I was using Hadoop >> 2.4.1 btw. I'm not sure whether Hadoop version matters here, but I did >> observe cases where Spark behaves differently because of semantic >> differences of the same API in different Hadoop versions. >> >> Cheng >> >> On 3/27/15 11:33 AM, Pei-Lun Lee wrote: >> >> Hi Cheng, >> >> on my computer, execute res0.save("xxx", org.apache.spark.sql.SaveMode. >> Overwrite) produces: >> >> peilunlee@pllee-mini:~/opt/spark-1.3...rc3-bin-hadoop1$ ls -l xxx >> total 32 >> -rwxrwxrwx 1 peilunlee staff 0 Mar 27 11:29 _SUCCESS* >> -rwxrwxrwx 1 peilunlee staff 272 Mar 27 11:29 part-r-00001.parquet* >> -rwxrwxrwx 1 peilunlee staff 272 Mar 27 11:29 part-r-00002.parquet* >> -rwxrwxrwx 1 peilunlee staff 272 Mar 27 11:29 part-r-00003.parquet* >> -rwxrwxrwx 1 peilunlee staff 488 Mar 27 11:29 part-r-00004.parquet* >> >> while res0.save("xxx") produces: >> >> peilunlee@pllee-mini:~/opt/spark-1.3...rc3-bin-hadoop1$ ls -l xxx >> total 40 >> -rwxrwxrwx 1 peilunlee staff 0 Mar 27 11:29 _SUCCESS* >> -rwxrwxrwx 1 peilunlee staff 250 Mar 27 11:29 _common_metadata* >> -rwxrwxrwx 1 peilunlee staff 272 Mar 27 11:29 part-r-00001.parquet* >> -rwxrwxrwx 1 peilunlee staff 272 Mar 27 11:29 part-r-00002.parquet* >> -rwxrwxrwx 1 peilunlee staff 272 Mar 27 11:29 part-r-00003.parquet* >> -rwxrwxrwx 1 peilunlee staff 488 Mar 27 11:29 part-r-00004.parquet* >> >> On Thu, Mar 26, 2015 at 7:26 PM, Cheng Lian <lian.cs....@gmail.com> >> wrote: >> >>> I couldn’t reproduce this with the following spark-shell snippet: >>> >>> scala> import sqlContext.implicits._ >>> scala> Seq((1, 2)).toDF("a", "b") >>> scala> res0.save("xxx", org.apache.spark.sql.SaveMode.Overwrite) >>> scala> res0.save("xxx", org.apache.spark.sql.SaveMode.Overwrite) >>> >>> The _common_metadata file is typically much smaller than _metadata, >>> because it doesn’t contain row group information, and thus can be faster to >>> read than _metadata. >>> >>> Cheng >>> >>> On 3/26/15 12:48 PM, Pei-Lun Lee wrote: >>> >>> Hi, >>> >>> When I save parquet file with SaveMode.Overwrite, it never generate >>> _common_metadata. Whether it overwrites an existing dir or not. >>> Is this expected behavior? >>> And what is the benefit of _common_metadata? Will reading performs >>> better when it is present? >>> >>> Thanks, >>> -- >>> Pei-Lun >>> >>> >>> >> >> >> > >